Evaluating new AI models takes months because traditional evaluation methods are unreliable. Benchmark evals are often gamed by AI companies and don't reflect real-world performance. Quick "vibe checks" using trick questions or artistic prompts provide limited signal. The only reliable method is testing models on actual work

10m read timeFrom seangoedecke.com
Post cover image
Table of contents
Evals systematically overstate how good frontier models areVibe checks are not reliableEvaluating practical use takes timeIs AI progress stagnating?Summary

Sort: