Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no retra

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Google Research has unveiled TurboQuant, a quantization algorithm that compresses LLM Key-Value caches by up to 6x using a two-step approach: a randomized Hadamard transform to normalize value distributions, followed by the Quantized Johnson-Lindenstrauss (QJL) transform to remove bias. At 3.5-bit compression, it matches 16-bit precision accuracy on benchmarks like LongBench and Needle in a Haystack across Gemma and Mistral models, with no retraining required. The practical impact is significant: a Llama 70B model with a 1M-token context window's KV cache shrinks from 328GB to ~72GB, enabling single H100 deployment. Community benchmarks suggest more realistic real-world gains of 30-40% in memory reduction and speed rather than the theoretical 6x maximum.

Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware