A talk and workshop on agent observability, covering why traditional eval-based testing is insufficient for production AI agents. Key topics include implicit signals (classifiers for user frustration, refusals, task failure) vs explicit signals (error rates, latency, cost), using regex patterns as cheap monitoring signals, self-diagnostics via a report tool that encourages agents to surface their own anomalous behavior, and A/B experimentation using semantic signals to measure the impact of prompt or model changes. The Raindrop platform is demonstrated as a production monitoring tool that provides trajectory visualization, alerting, clustering of user intents, and a triage agent that automatically investigates signal spikes.

50m watch time

Sort: