AI evals are becoming the new compute bottleneck

AI evaluation costs have crossed a threshold that now rivals or exceeds training costs for many benchmarks. The Holistic Agent Leaderboard (HAL) spent $40,000 on 21,730 agent rollouts across 9 models and 9 benchmarks, while a single GAIA frontier model run can cost $2,829. Scientific ML benchmarks like The Well require 960 H100-hours per architecture evaluation. Unlike static LLM benchmarks — which can be compressed 100–200× while preserving rankings — agent benchmarks compress only 2–3.5×, and training-in-the-loop benchmarks resist compression entirely. Reliability compounds costs further: statistically credible multi-seed evaluations multiply single-run costs by 8× or more. This creates an accountability barrier where only frontier labs can afford rigorous independent evaluation, concentrating the social process of AI assessment inside the same organizations that build the models. The post advocates for standardized eval artifact sharing to reduce redundant re-runs, pointing to the EvalEval Coalition's 'Every Eval Ever' project as a practical solution.

#machine-learning

#ai-agents

Apr 29•19m read time•From huggingface.co

Table of contents

Making static LLM benchmarks cheaper Agent evals are messier Some evals are just training Reliability is the expensive part What this means for ML as a field Cost summary across benchmark types Stop paying twice for the same eval Where this leaves us

Comment

Bookmark

Copy

Sort: