A comprehensive test of modern LLMs reveals surprising inconsistencies in their ability to count letters in words. While most models correctly identify 3 r's in "strawberry" and 2 b's in "blueberry," GPT-5 Chat fails the blueberry test 73% of the time, often confidently claiming there are 3 b's. The study tested multiple models including GPT-5 variants, Claude, and Gemini across 274 trials, showing that tokenization limitations don't fully explain these failures since some models perform perfectly while others struggle with the same task.

10m read timeFrom minimaxir.com
Post cover image

Sort: