Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

Cloudflare's billing pipeline started missing daily deadlines after a ClickHouse table partitioning key change from (day) to (namespace, day) to enable per-tenant retention. Standard metrics showed nothing wrong — I/O, memory, and parts read per query all looked normal. Using ClickHouse's trace_log and flame graphs, engineers discovered that over half of query duration was spent waiting on a single exclusive mutex (MergeTreeData) that every query planner thread had to acquire to copy the full parts list. Three sequential fixes were applied: (1) switching from an exclusive to a shared lock, eliminating the contention immediately; (2) deferring the full vector copy by maintaining a shared cache, reducing copy overhead; (3) replacing a linear scan over all parts with a binary search on the namespace partition key, cutting query durations by 50% and breaking the correlation between part count and latency. The fixes were contributed upstream and merged into ClickHouse 25.11. At peak, the cluster reached 160k parts per replica, but query durations remain stable.

#clickhouse

May 14•11m read time•From blog.cloudflare.com

Table of contents

The setup: a petabyte-scale analytics platform The problem: one retention policy to rule them all The solution: a new partitioning scheme The mystery: when billing starts to break The investigation: hunting bottlenecks with flame graphs The fixes: a trio of patches An uneasy truce

Comment

Bookmark

Copy

Sort: