TurboQuant: A First-Principles Walkthrough

TurboQuant is a data-oblivious vector quantization method that compresses AI vectors (KV caches, embeddings, attention keys) to 2–4 bits per coordinate with near-optimal distortion and zero per-block metadata overhead. The core insight is that a random rotation transforms any input vector's coordinates into an approximately Gaussian distribution, allowing a single Lloyd–Max codebook precomputed once to be reused for every input. A QJL residual pass corrects the inner-product bias introduced by MSE-optimal quantization. The result is provably within a constant factor of Shannon's rate-distortion lower bound, matches full-precision Needle-in-a-Haystack recall on Llama-3.1-8B at 2 bits, and quantizes 100K vectors 4–6 orders of magnitude faster than Product Quantization or RabitQ.

#machine-learning

#ai-inference

Yesterday•16m read time•From arkaung.github.io

Table of contents

Eight ideas the rest of the page is built on.What is vector quantization, really?The adversarial coordinate, and why production systems pay a tax Multiply by a random rotation. Watch the spike dissolve.Coordinates of random unit vectors are nearly Gaussian.Lloyd–Max: the optimal partition of a known distribution.Putting it together: TurboQuant-MSE.MSE-optimal quantizers underestimate inner products.If the bias is a known number, multiply it out.How close is TurboQuant to the theoretical best?Concrete wins in LLM inference and vector search.

Comment

Bookmark

Copy

Sort: