TurboQuant is a data-oblivious vector quantization method that compresses AI vectors (KV caches, embeddings, attention keys) to 2–4 bits per coordinate with near-optimal distortion and zero per-block metadata overhead. The core insight is that a random rotation transforms any input vector's coordinates into an approximately Gaussian distribution, allowing a single Lloyd–Max codebook precomputed once to be reused for every input. A QJL residual pass corrects the inner-product bias introduced by MSE-optimal quantization. The result is provably within a constant factor of Shannon's rate-distortion lower bound, matches full-precision Needle-in-a-Haystack recall on Llama-3.1-8B at 2 bits, and quantizes 100K vectors 4–6 orders of magnitude faster than Product Quantization or RabitQ.

16m read timeFrom arkaung.github.io
Post cover image
Table of contents
Eight ideas the rest of the page is built on.What is vector quantization, really?The adversarial coordinate, and why production systems pay a taxMultiply by a random rotation. Watch the spike dissolve.Coordinates of random unit vectors are nearly Gaussian.Lloyd–Max: the optimal partition of a known distribution.Putting it together: TurboQuant-MSE.MSE-optimal quantizers underestimate inner products.If the bias is a known number, multiply it out.How close is TurboQuant to the theoretical best?Concrete wins in LLM inference and vector search.

Sort: