Google Research has unveiled TurboQuant, a quantization algorithm that compresses LLM Key-Value caches by up to 6x using a two-step approach: a randomized Hadamard transform to normalize value distributions, followed by the Quantized Johnson-Lindenstrauss (QJL) transform to remove bias. At 3.5-bit compression, it matches 16-bit precision accuracy on benchmarks like LongBench and Needle in a Haystack across Gemma and Mistral models, with no retraining required. The practical impact is significant: a Llama 70B model with a 1M-token context window's KV cache shrinks from 328GB to ~72GB, enabling single H100 deployment. Community benchmarks suggest more realistic real-world gains of 30-40% in memory reduction and speed rather than the theoretical 6x maximum.

4m read timeFrom infoq.com
Post cover image

Sort: