A conference talk covering the maturity phases of running evaluations for AI agents. Starting from basic vibe-checking with human annotation, progressing to automated LLM-as-judge scoring, then handling complex agents that interact with external systems via tool calls and CRUD operations. Key concepts include building an eval flywheel using production traces, the difference between evals and unit tests, deterministic vs. LLM-based scoring, and techniques for representing external system state during offline eval runs.

18m watch time

Sort: