A preface to an upcoming book on the science of machine learning benchmarks. It argues that while benchmarks have well-documented flaws — gaming, overfitting, bias, labor exploitation — they have undeniably driven ML progress. The author explores why benchmarks worked despite lacking rigorous statistical foundations, focusing on model rankings rather than absolute scores as the true scientific output. The book covers the holdout method, adaptivity problems, cross-validation, and transitions to LLM-era challenges: unknown training data contamination, multi-task aggregation problems (drawing on social choice theory), performativity, and the existential challenge of evaluating models that surpass human evaluators. The goal is to build a proper scientific foundation for benchmarking practice.

14m read timeFrom mlbenchmarks.org
Post cover image
Table of contents
OverviewWho is this book for?Acknowledgments

Sort: