Building high-performance full-text search for object storage

ClickHouse engineers detail the redesigned full-text search index built to perform efficiently on object storage. The new design uses a block-based dictionary with front-coding compression, a sparse in-memory index for fast block lookup, and three posting list representations (Roaring Bitmaps, VarInt-encoded, and embedded) chosen adaptively by cardinality. Sequential access patterns replace the old FST-based random-read approach, enabling efficient merges and queries on remote storage. Query execution supports three modes: direct read (posting lists resolve queries without touching the text column), direct read with hint (posting lists narrow candidates before LIKE evaluation), and fallback granule-skipping. Configuration covers tokenizers (splitByNonAlpha, splitByString, ngrams, sparseGrams, array) and a preprocessor pipeline for lowercasing, accent removal, and HTML stripping. Benchmarks show over 7x speedup on array-tag queries. Upcoming work includes JSON column indexing, phrase search, and faster regex evaluation.

#backend

#clickhouse

#full-text-search

Mar 24•32m read time•From clickhouse.com

Table of contents

How the query engine uses the text index #Defining the text index #Using the text index in queries #What next #What this means for ClickHouse Cloud users #

Comment

Bookmark

Copy

Sort: