Index sharding in ClickHouse Cloud: Petabyte-scale data needs petabyte-scale indexing

ClickHouse Cloud introduces index sharding (distributed index analysis) for SharedMergeTree tables, distributing primary and secondary index analysis across replicas using consistent hashing. Previously, every replica loaded the full index into working memory — at petabyte scale this could mean 100–400 GiB per replica. With index sharding, each replica owns only a 1/N slice of the index, capping total fleet-wide index memory at the size of the index itself regardless of replica count. Benchmarks on a 50 billion row table show up to 7.7× faster index analysis for bloom filter lookups, 7.2× for vector search, and 4.3× for primary key range queries. The feature activates automatically when tables exceed configurable thresholds (default: 10 parts and 1 GB of index data). It is currently available in private preview on ClickHouse Cloud.

#distributed-systems

#vector-search

#clickhouse

Apr 21•14m read time•From clickhouse.com

Table of contents

How index analysis works #The bottleneck: indexes don't scale horizontally alongside replicas #Index sharding and the core concepts #What happens when a replica is added to the service? #How does ClickHouse protect itself from failure to analyze? #How does index sharding reduce memory usage on replicas? #How does index sharding increase analysis performance? #What kind of workloads will this benefit the most? #

Comment

Bookmark

Copy

Sort: