Agent failures do not look like normal software failures. In this workshop, the Raindrop team breaks down what it actually takes to monitor production agents, from explicit signals like tool errors, latency, and cost to fuzzier signals like user frustration, refusals, task failure, and capability gaps.

The session covers how to move beyond evals toward real production observability, how to use classifiers, regex, and experiments to catch regressions, and how to instrument self-diagnostics so agents can report their own failures and strange behavior. If you're running agents in production, this is a practical framework for understanding what is going wrong and how to catch it early.

Speaker info:
- https://x.com/benhylak
- https://www.linkedin.com/in/benhylak/
- Danny Gollapalli

AI Engineer

A talk and workshop on agent observability, covering why traditional eval-based testing is insufficient for production AI agents. Key topics include implicit signals (classifiers for user frustration, refusals, task failure) vs explicit signals (error rates, latency, cost), using regex patterns as cheap monitoring signals, self-diagnostics via a report tool that encourages agents to surface their own anomalous behavior, and A/B experimentation using semantic signals to measure the impact of prompt or model changes. The Raindrop platform is demonstrated as a production monitoring tool that provides trajectory visualization, alerting, clustering of user intents, and a triage agent that automatically investigates signal spikes.

Everything You Need To Know About Agent Observability — Danny Gollapalli and Ben Hylak, Raindrop