ClickHouse engineers detail the redesigned full-text search index built to perform efficiently on object storage. The new design uses a block-based dictionary with front-coding compression, a sparse in-memory index for fast block lookup, and three posting list representations (Roaring Bitmaps, VarInt-encoded, and embedded) chosen adaptively by cardinality. Sequential access patterns replace the old FST-based random-read approach, enabling efficient merges and queries on remote storage. Query execution supports three modes: direct read (posting lists resolve queries without touching the text column), direct read with hint (posting lists narrow candidates before LIKE evaluation), and fallback granule-skipping. Configuration covers tokenizers (splitByNonAlpha, splitByString, ngrams, sparseGrams, array) and a preprocessor pipeline for lowercasing, accent removal, and HTML stripping. Benchmarks show over 7x speedup on array-tag queries. Upcoming work includes JSON column indexing, phrase search, and faster regex evaluation.

32m read timeFrom clickhouse.com
Post cover image
Table of contents
How the query engine uses the text index #Defining the text index #Using the text index in queries #What next #What this means for ClickHouse Cloud users #

Sort: