Anthropic shipped three quality regressions in Claude Code over six weeks that its own evals failed to catch — a wake-up call for any team deploying AI in production. The core argument is that most AI teams have a measurement problem, not a quality problem. Good evals aren't just test suites; they encode what quality means for a specific product, separate regression from capability testing, and treat user complaints as the most valuable input. Practical guidance includes: writing 20–50 evals drawn from real production failures, distinguishing pass@k from pass^k for reliability requirements, separating quality/latency/cost as independent trade-offs, making regression scores a hard release gate rather than a report, and writing the eval before writing the prompt. Bad evals — too narrow, uncalibrated LLM-as-judge, never updated — create false confidence that is worse than no measurement at all.
Sort: