It's all fake
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A critical breakdown of how major AI benchmarks are being gamed or are fundamentally flawed. Examples include Terminal Bench (exploited via curl injection), SWE-Bench (overridden with conftest files), Web Arena (golden answers readable from config files), Fieldwork Arena (validation function that accepts any AI-looking response), and GAIA (self-reported leaderboard with no sandbox). Also covers Anthropic's misleading chart axes, Meta's 'claudonomics' token burn culture, and fake GitHub stars on repos like GStack. The throughline is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
•12m watch time
Sort: