Most companies don't have an AI quality problem. They have a measurement problem.

InfoWorld is a source of news, analysis, and commentary on technology trends, IT strategies, and business innovation. With a focus on enterprise technology and digital transformation, InfoWorld offers insights and guidance for IT decision-makers, software developers, and technology professionals. From  articles on cloud computing and cybersecurity to product reviews and industry trends, InfoWorld helps readers navigate the complexities of modern IT environments and make informed decisions to drive business success.

InfoWorld

Anthropic shipped three quality regressions in Claude Code over six weeks that its own evals failed to catch — a wake-up call for any team deploying AI in production. The core argument is that most AI teams have a measurement problem, not a quality problem. Good evals aren't just test suites; they encode what quality means for a specific product, separate regression from capability testing, and treat user complaints as the most valuable input. Practical guidance includes: writing 20–50 evals drawn from real production failures, distinguishing pass@k from pass^k for reliability requirements, separating quality/latency/cost as independent trade-offs, making regression scores a hard release gate rather than a report, and writing the eval before writing the prompt. Bad evals — too narrow, uncalibrated LLM-as-judge, never updated — create false confidence that is worse than no measurement at all.

Making AI work through eval hygiene