Cloudflare experienced a 6-hour outage affecting major services like ChatGPT, Spotify, and Reddit when a database permissions change in ClickHouse caused the Bot Management module to receive more features than its 200-feature limit, triggering system panics across edge nodes. The root cause took 2.5 hours to identify partly

10m read timeFrom blog.pragmaticengineer.com
Post cover image
Table of contents
What happened this time with Cloudflare?Why so long to find the root cause?How did the postmortem come so fast?Learnings

Sort: