Robert Youssef @rryssf_
a benchmark isn't a dataset. it's a triplet: dataset, model, judge. new paper audited Omni-MATH (Olympiad-level math) and found Omni-Judge was wrong in 96.4% of disagreements with GPT-5 mini. not a few edge cases. nearly every single time the judges disagreed, the weaker one was incorrect. worse: swapping judges changed the actual ranking of frontier models. Claude Sonnet 4.5, DeepSeek v3.2, Gemini 3 Pro, GPT-5, Kimi K2 Thinking. same problems. different judge. different leaderboard order. and the disagreement rate increases with problem difficulty. meaning the harder the question, the more your benchmark score reflects judge competence instead of model competence. 14.6% of the original dataset also had errors. missing images, broken LaTeX, problems asking for proofs but verifying against exact answers. a PhD mathematician manually cleaned every single entry. the uncomfortable implication: most benchmarks we're using to compare frontier models are partly measuring how bad the evaluator is at grading. we're not hitting model ceilings. we're hitting judge ceilings. and almost nobody reports which judge they used alongside their scores.
Sort: