TurboQuant is a new algorithmic suite from Google that compresses KV cache in large language models and vector search engines down to 3 bits without accuracy loss or model retraining. It uses a two-stage process: PolarQuant maps vectors to polar coordinates to eliminate memory overhead from quantization constants, and QJL (Quantized Johnson-Lindenstrauss) applies 1-bit compression to remove residual biases introduced in the first stage. Together, these techniques produce unbiased attention score estimators grounded in strong theoretical foundations, setting a new efficiency benchmark near theoretical lower bounds.
Table of contents
IntroductionTurboQuant in a NutshellInside the KV Compression ProcessFinal ConsiderationsSort: