Cursor's engineering team details how they built a local regex search index to replace slow ripgrep invocations in large monorepos. The post covers the evolution of text indexing techniques: classic trigram inverted indexes (as used in google/codesearch and zoekt), suffix arrays (livegrep), probabilistic bloom-filter-augmented trigram indexes (GitHub's Project Blackbird), and sparse n-grams with frequency-weighted hashing (used in ClickHouse and GitHub's new Code Search). Cursor chose sparse n-grams with a character-pair frequency table derived from terabytes of open-source code, stored locally on the user's machine using mmap'd lookup tables backed by on-disk posting lists. The index is kept fresh by layering user/agent changes on top of a Git commit baseline. The result is dramatically faster regex search for AI agents working in large enterprise codebases.

Table of contents
# The classic algorithm# Suffix Arrays: a detour# Trigram Queries with Probabilistic Masks# Sparse N-grams: Smarter Trigram Selection# All this, in your machine# ConclusionsSort: