DigitalOcean replaced its incident-based availability tracking with an SLI-driven framework after discovering that monthly availability numbers (oscillating between 99.5% and 99.9%) reflected incident declaration patterns rather than actual customer experience. The new framework splits measurement into two planes: a Control Plane (API/orchestration layer, measured by request success rate excluding 4xx errors) and a Data Plane (live resources like Droplets, DOKS, Spaces, measured by either resource-minutes or request success rate depending on product type). Aggregation across regions uses weighted request volume so busier data centers contribute proportionally more signal. The framework also incorporates Prometheus recording rules, multi-window alerting, and error budget policies that now drive engineering priorities. It is being extended to newer products including GPU Droplets and Agentic Inference Cloud.
Table of contents
The Old MethodologySplitting Our Measurement: Control Plane vs Data PlaneMagnitude mattersSort: