AI benchmarks are saturating, but that doesn't mean AI can handle real-world complexity. Researchers are turning to 'open-world evaluations' — long, messy, real-world tasks with small sample sizes, human intervention, and qualitative log analysis — as a complementary signal to traditional benchmarks. This paper defines the concept, surveys 10 notable examples from 2025-2026 (including Claude playing Pokemon, Anthropic's C compiler, and Cursor's browser experiment), and introduces CRUX, a 17-researcher collaboration that will regularly conduct such evaluations. In CRUX's first experiment, an AI agent nearly autonomously built and published a breathing exercise app to the iOS App Store for ~$1,000, making only two errors. The result serves as an early warning that app store spam via autonomous agents is imminent — a finding disclosed to Apple before publication. Best practices for running open-world evals are also outlined, including documenting human interventions, investing in log analysis, conducting dry runs, and measuring cost.
Table of contents
Open-world evaluations are an important emerging class of AI evaluationIntroducing CRUX: Collaborative Research for Updating AI eXpectationsSort: