What type of real world model responses do users still hate? We get to see millions of user's prompts - and we let users 'dislike both' on the Arena. We'll show you trends and examples of the tasks that LLMs still suck at despite the relentless hillclimbing.

Speaker info:
- https://x.com/petergostev
- https://www.linkedin.com/in/peter-gostev/

AI Engineer

Peter Gostev from Arena.ai presents two lenses for examining where LLMs still fall short despite impressive benchmark trends. First, the BullshitBench: 155 nonsense questions fed to models to see if they push back or comply. Results show most GPT and Gemini models accept nonsense ~50% of the time, while Claude models perform best. Notably, enabling reasoning/thinking mode often makes compliance with nonsense worse, not better. Second, Arena's human preference data (5.5M+ votes) reveals a persistent ~9% dissatisfaction rate even among top-25 models, with categories like creative writing, gaming, law, and finance showing far less improvement than math or quantitative tasks. The core argument: standard benchmarks measure narrow, well-defined tasks and miss the broader distribution of real work, so the 'line goes up' narrative is misleading about actual model capability gaps.

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench