AI benchmarks are saturating, but that doesn't mean AI can handle real-world complexity. Researchers are turning to 'open-world evaluations' — long, messy, real-world tasks with small sample sizes, human intervention, and qualitative log analysis — as a complementary signal to traditional benchmarks. This paper defines the
Table of contents
Open-world evaluations are an important emerging class of AI evaluationIntroducing CRUX: Collaborative Research for Updating AI eXpectationsSort: