Analysis of Cloudflare's February 20, 2026 outage based on their public write-up. The incident was triggered by a bug in new automation meant to replace a dangerous manual process for removing BYOIP prefixes. A client-server mismatch in how a query flag was passed (missing '=true') caused the server to return all prefixes instead of only those pending deletion, leading to active prefixes being deleted. Key observations include: reliability improvements can introduce new failure modes, cleanup operations are inherently risky, in-flight reliability work (Code Orange: Fail Small) wasn't yet ready to speed recovery, and engineers had to perform intensive manual data recovery steps to restore service over roughly six hours.
Table of contents
System intended to improve reliability contributed to incidentHow do you pass the flag?Cleanup, but still in useReliability work in-flightPeople adapt to bring the system back to healthySort: