sean goedecke

Evaluating new AI models takes months because traditional evaluation methods are unreliable. Benchmark evals are often gamed by AI companies and don't reflect real-world performance. Quick "vibe checks" using trick questions or artistic prompts provide limited signal. The only reliable method is testing models on actual work problems, which requires significant time and effort. This evaluation challenge makes it difficult to determine whether AI progress is genuinely stagnating or continuing to advance, especially since humans struggle to assess capabilities of models that surpass their own intelligence in specific domains.

Why it takes months to tell if new AI models are good

Evals systematically overstate how good frontier models are