How we're building indexes for regular expression search so agents can find text in large monorepos without the 15-second ripgrep waits.

Cursor

Cursor's engineering team details how they built a local regex search index to replace slow ripgrep invocations in large monorepos. The post covers the evolution of text indexing techniques: classic trigram inverted indexes (as used in google/codesearch and zoekt), suffix arrays (livegrep), probabilistic bloom-filter-augmented trigram indexes (GitHub's Project Blackbird), and sparse n-grams with frequency-weighted hashing (used in ClickHouse and GitHub's new Code Search). Cursor chose sparse n-grams with a character-pair frequency table derived from terabytes of open-source code, stored locally on the user's machine using mmap'd lookup tables backed by on-disk posting lists. The index is kept fresh by layering user/agent changes on top of a Git commit baseline. The result is dramatically faster regex search for AI agents working in large enterprise codebases.

Fast regex search: indexing text for agent tools · Cursor

# Trigram Queries with Probabilistic Masks

# Sparse N-grams: Smarter Trigram Selection