AI evaluation costs have crossed a threshold that now rivals or exceeds training costs for many benchmarks. The Holistic Agent Leaderboard (HAL) spent $40,000 on 21,730 agent rollouts across 9 models and 9 benchmarks, while a single GAIA frontier model run can cost $2,829. Scientific ML benchmarks like The Well require 960 H100-hours per architecture evaluation. Unlike static LLM benchmarks — which can be compressed 100–200× while preserving rankings — agent benchmarks compress only 2–3.5×, and training-in-the-loop benchmarks resist compression entirely. Reliability compounds costs further: statistically credible multi-seed evaluations multiply single-run costs by 8× or more. This creates an accountability barrier where only frontier labs can afford rigorous independent evaluation, concentrating the social process of AI assessment inside the same organizations that build the models. The post advocates for standardized eval artifact sharing to reduce redundant re-runs, pointing to the EvalEval Coalition's 'Every Eval Ever' project as a practical solution.
Table of contents
Making static LLM benchmarks cheaperAgent evals are messierSome evals are just trainingReliability is the expensive partWhat this means for ML as a fieldCost summary across benchmark typesStop paying twice for the same evalWhere this leaves usSort: