Agent observability differs fundamentally from traditional software observability because agents are non-deterministic — you can't predict behavior until runtime. This post explains why debugging agents means debugging reasoning rather than code, introduces three core observability primitives (runs, traces, threads), and shows how these primitives map directly to three levels of agent evaluation: single-step (unit tests for decisions), full-turn (end-to-end trajectory), and multi-turn (context persistence across sessions). Production traces serve triple duty: manual debugging, building offline evaluation datasets from real failures, and powering continuous online evaluation. The key insight is that observability and evaluation are inseparable for agents — traces are the only source of truth for what an agent actually did.
Table of contents
From debugging code to debugging reasoningAgent observability ≠ software observabilityAgent evaluation ≠ software evaluationThe primitives of agent observabilityHow this influences agent evaluationHow agent observability powers agent evaluationWhat this means for teams building agents1 Comment
Sort: