Google Research introduces TurboQuant, a theoretically grounded quantization algorithm for compressing large language model KV caches and vector search indices. TurboQuant combines two sub-algorithms: PolarQuant, which converts vectors to polar coordinates to eliminate quantization overhead, and QJL (Quantized Johnson-Lindenstrauss), a 1-bit error-correction step with zero memory overhead. Together they achieve 6x+ KV memory reduction with no accuracy loss, no fine-tuning required, and up to 8x attention computation speedup over 32-bit unquantized keys on H100 GPUs. Benchmarks on Gemma and Mistral across LongBench, Needle In A Haystack, and other long-context tasks show near-lossless performance. The work also outperforms state-of-the-art vector search baselines (PQ, RabbiQ) in recall without dataset-specific tuning.

7m read timeFrom research.google
Post cover image

Sort: