Cloudflare's billing pipeline started missing daily deadlines after a ClickHouse table partitioning key change from (day) to (namespace, day) to enable per-tenant retention. Standard metrics showed nothing wrong — I/O, memory, and parts read per query all looked normal. Using ClickHouse's trace_log and flame graphs, engineers discovered that over half of query duration was spent waiting on a single exclusive mutex (MergeTreeData) that every query planner thread had to acquire to copy the full parts list. Three sequential fixes were applied: (1) switching from an exclusive to a shared lock, eliminating the contention immediately; (2) deferring the full vector copy by maintaining a shared cache, reducing copy overhead; (3) replacing a linear scan over all parts with a binary search on the namespace partition key, cutting query durations by 50% and breaking the correlation between part count and latency. The fixes were contributed upstream and merged into ClickHouse 25.11. At peak, the cluster reached 160k parts per replica, but query durations remain stable.
Table of contents
The setup: a petabyte-scale analytics platformThe problem: one retention policy to rule them allThe solution: a new partitioning schemeThe mystery: when billing starts to breakThe investigation: hunting bottlenecks with flame graphsThe fixes: a trio of patchesAn uneasy truceSort: