Deja vu: a large Cloudflare outage caused by an instantly rolled-out global config change – two weeks after a similar problem

Pragmatic Engineer is a blog authored by Gergely Orosz, covering topics related to software engineering, career growth, and technology leadership. With a focus on practical advice and real-world experiences, Pragmatic Engineer offers  insights for developers and tech professionals. Developers can learn about software design principles, career advancement strategies, and industry trends through Pragmatic Engineer's articles and essays.

The Pragmatic Engineer

Cloudflare experienced its second major outage in two weeks, both caused by global configuration changes that propagated instantly across their network. The latest incident affected 28% of HTTP traffic for 25 minutes when a killswitch for an internal testing tool unexpectedly triggered HTTP 500 errors. The company acknowledges that implementing staged rollouts for all configuration changes remains their top priority, though this infrastructure work could take months. Historical examples from Meta, AWS, Datadog, and Google Cloud demonstrate that global configuration errors are a common cause of large-scale outages, particularly in systems operating at massive scale.

The Pulse: Cloudflare’s latest outage proves dangers of global configuration changes (again)

Global configuration errors often trigger large outages