Cloudflare declared "Code Orange: Fail Small" following two major outages in November and December 2025. Both incidents were caused by instantaneous global deployment of configuration changes that broke the network. The resilience plan focuses on three areas: implementing controlled rollouts for configuration changes (similar to existing software deployment processes), reviewing and improving failure modes across all systems to handle errors gracefully, and fixing break glass procedures to remove circular dependencies. The goal is to ensure configuration changes pass through testing gates before global deployment, preventing single changes from taking down the entire network.

10m read timeFrom blog.cloudflare.com
Post cover image
Table of contents
What went wrong?We will change how we deploy configuration updates at CloudflareHow will we address failure modes between services?How will we solve emergencies faster?When will we be done?
1 Comment

Sort: