A database permissions change caused Cloudflare's Bot Management feature file to double in size with duplicate entries, exceeding hardcoded limits in the proxy system and triggering widespread 5xx errors across the network for over 3 hours. The issue affected core CDN services, Workers KV, Access, and Turnstile, with intermittent failures making diagnosis difficult as the system alternated between good and bad configuration files every five minutes. The team initially suspected a DDoS attack before identifying the root cause: a ClickHouse query that didn't filter by database name began returning duplicate column metadata after permission changes. Recovery involved stopping automated file generation, manually deploying a known-good configuration, and restarting affected services.
Table of contents
The outageHow Cloudflare processes requests, and how this went wrong todayRemediation and follow-up steps1 Comment
Sort: