Most teams approach evals like unit tests and try to cover every possible failure. Phil Hetzel from Braintrust argues that is the wrong frame: enumerate your known failure modes, cover those specifically, and ship. The goal is a flywheel where production traces surface what is going wrong, feed back into offline experimentation, and guide the next improvement.

The session walks four maturity stages: vibe checking with documented human justifications not just thumbs up or down, LLM as judge built from those justifications at scale, then the hard part, tool calls that touch external systems. Context gathering tools are manageable. CRUD tools are not, because you have to represent the state of external systems at the exact moment the original trace ran. Timestamp queries against a vector database and injecting captured system state directly into the trace are two approaches for getting there.

Speaker info:
- https://www.linkedin.com/in/philliphetzel/

AI Engineer

A conference talk covering the maturity phases of running evaluations for AI agents. Starting from basic vibe-checking with human annotation, progressing to automated LLM-as-judge scoring, then handling complex agents that interact with external systems via tool calls and CRUD operations. Key concepts include building an eval flywheel using production traces, the difference between evals and unit tests, deterministic vs. LLM-based scoring, and techniques for representing external system state during offline eval runs.

The maturity phases of running evals — Phil Hetzel, Braintrust