Cloudflare experienced its second major outage in two weeks, both caused by global configuration changes that propagated instantly across their network. The latest incident affected 28% of HTTP traffic for 25 minutes when a killswitch for an internal testing tool unexpectedly triggered HTTP 500 errors. The company acknowledges that implementing staged rollouts for all configuration changes remains their top priority, though this infrastructure work could take months. Historical examples from Meta, AWS, Datadog, and Google Cloud demonstrate that global configuration errors are a common cause of large-scale outages, particularly in systems operating at massive scale.

7m read timeFrom blog.pragmaticengineer.com
Post cover image
Table of contents
Global configuration errors often trigger large outages

Sort: