Traditional SRE metrics like HTTP uptime and latency fail to capture whether AI agents are actually doing useful work. A three-layer SLO framework is proposed: service-level reliability, output validity, and task success. Each layer fails independently and requires separate error budgets, with burn-rate alerting rather than total-burn tracking. Eight named failure modes are defined (model regression, tool failure, retrieval drift, prompt regression, schema drift, provider outage, cost spike, hallucination), each with distinct detection signals and runbook steps. Reliability drills, OpenTelemetry-based traces with agent-specific attributes, and postmortems that produce actionable followups round out the framework. Several traditional SRE practices β five-nines targets, pure synthetic monitoring, latency-only deployment gates β are identified as unsuitable for non-deterministic agent systems.
Table of contents
Why Traditional Reliability Numbers Lie About AgentsThe Three Layers Of An Agent SLOError Budgets That Match The RealityThe Failure Modes Worth NamingDrills That Find The Bugs Before Production DoesObservability That Connects LayersThe Postmortem Discipline That Actually HelpsWhat Does Not Carry Over From Traditional SREWhat This Looks Like When It WorksSort: