Cloudflare has completed its 'Code Orange: Fail Small' engineering initiative, launched in response to major global outages in November and December 2025. Key outcomes include: Snapstone, a new internal system enabling health-mediated progressive rollout with automated rollback for configuration changes; improved failure modes including 'fail stale' and 'fail open/close' strategies; traffic segmentation by customer cohorts to limit blast radius; backup 'break glass' authorization pathways for 18 critical services; and an Engineering Codex — a living rulebook enforced via AI code review at every stage of the development lifecycle. The Codex captures lessons like avoiding unchecked .unwrap() calls in Rust and validating upstream dependencies, turning post-incident learnings into enforced standards. Customer communication improvements include predictable update intervals during incidents and detailed post-mortems.
Table of contents
Safer configuration changesReducing the impact of failureRevised “break glass” and incident management proceduresWe have codified our improvementsIt’s not just about code: communication is keyThis initiative is complete. But our work on resiliency is never done.2 Comments
Sort: