Google's TurboQuant is a new vector quantization algorithm for compressing LLM key-value (KV) caches, claiming up to 6x memory reduction with no negative impact on inference times. It combines two algorithms — PolarQuant (which applies a polar coordinates transformation to minimize quantization error) and QJL (a quantized Johnson-Lindenstrauss transform). The resulting format uses 3-bit values, theoretically 25% smaller than NVIDIA's NVFP4 approach. However, the author notes that Google's benchmarking data is vague, lacks direct comparisons with NVFP4, and the claims may be overhyped — suggesting TurboQuant is likely an incremental improvement rather than a breakthrough, with independent benchmarking still needed.

6m read timeFrom hackaday.com
Post cover image
Table of contents
Key-Value CacheTurbo QuantizationJudging On Merits

Sort: