Robert Youssef

A new paper auditing Omni-MATH reveals fundamental flaws in how AI benchmarks are constructed and evaluated. The Omni-Judge was wrong in 96.4% of disagreements with GPT-4o mini, and swapping judges changed the ranking order of frontier models like Claude Sonnet 4.5, DeepSeek v3.2, Gemini 3 Pro, GPT-5, and Kimi K2 Thinking. Disagreement rates increase with problem difficulty, meaning harder questions reflect judge competence more than model competence. Additionally, 14.6% of the original dataset contained errors. The core argument: a benchmark is not just a dataset but a triplet of dataset, model, and judge — and most benchmark scores are partly measuring evaluator quality, not model capability. Almost no one reports which judge was used alongside scores.

a benchmark isn't a dataset. it's a triplet: dataset, model, judge.

new paper audited Omni-MATH (Olympiad-level math) and found Omni-Judge was wrong in 96.4% of disagreements with GPT-5 mini. not a few edge cases. nearly every single time the judges disagreed, the weaker one was incorrect.

worse: swapping judges changed the actual ranking of frontier models. Claude Sonnet 4.5, DeepSeek v3.2, Gemini 3 Pro, GPT-5, Kimi K2 Thinking. same problems. different judge. different leaderboard order.

and the disagreement rate increases with problem difficulty. meaning the harder the question, the more your benchmark score reflects judge competence instead of model competence.

14.6% of the original dataset also had errors. missing images, broken LaTeX, problems asking for proofs but verifying against exact answers. a PhD mathematician manually cleaned every single entry.

the uncomfortable implication: most benchmarks we're using to compare frontier models are partly measuring how bad the evaluator is at grading. we're not hitting model ceilings. we're hitting judge ceilings.

and almost nobody reports which judge they used alongside their scores.

<p>a benchmark isn't a dataset. it's a triplet: dataset, model, judge.

new paper audited Omni-MATH (Olympiad-level math) and found Omni-Judge was wrong in 96.4% of disagreements with GPT-5 mini. not a few edge cases. nearly every single time the judges disagreed, the weaker one was incorrect.

worse: swapping judges changed the actual ranking of frontier models. Claude Sonnet 4.5, DeepSeek v3.2, Gemini 3 Pro, GPT-5, Kimi K2 Thinking. same problems. different judge. different leaderboard order.

and the disagreement rate increases with problem difficulty. meaning the harder the question, the more your benchmark score reflects judge competence instead of model competence.

14.6% of the original dataset also had errors. missing images, broken LaTeX, problems asking for proofs but verifying against exact answers. a PhD mathematician manually cleaned every single entry.

the uncomfortable implication: most benchmarks we're using to compare frontier models are partly measuring how bad the evaluator is at grading. we're not hitting model ceilings. we're hitting judge ceilings.

and almost nobody reports which judge they used alongside their scores.</p>

A new paper audited Omni-MATH, an Olympiad-level math benchmark, revealing that a benchmark is not just a dataset but a triplet of dataset, model, and judge. The findings highlight issues with how benchmarks are evaluated and interpreted in AI research.