What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Peter Gostev from Arena.ai presents two lenses for examining where LLMs still fall short despite impressive benchmark trends. First, the BullshitBench: 155 nonsense questions fed to models to see if they push back or comply. Results show most GPT and Gemini models accept nonsense ~50% of the time, while Claude models perform best. Notably, enabling reasoning/thinking mode often makes compliance with nonsense worse, not better. Second, Arena's human preference data (5.5M+ votes) reveals a persistent ~9% dissatisfaction rate even among top-25 models, with categories like creative writing, gaming, law, and finance showing far less improvement than math or quantitative tasks. The core argument: standard benchmarks measure narrow, well-defined tasks and miss the broader distribution of real work, so the 'line goes up' narrative is misleading about actual model capability gaps.

20m watch time

Sort: