Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
Google Research has unveiled TurboQuant, a quantization algorithm that compresses LLM Key-Value caches by up to 6x using a two-step approach: a randomized Hadamard transform to normalize value distributions, followed by the Quantized Johnson-Lindenstrauss (QJL) transform to remove bias. At 3.5-bit compression, it matches 16-bit
Sort: