TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from ver

vLLM

A comprehensive benchmarking study comparing TurboQuant KV-cache quantization variants against FP8 and BF16 baselines across four large models (30B–200B+ parameters) and five benchmarks. Key findings: FP8 remains the best default for KV-cache quantization, offering 2x capacity with negligible accuracy loss and no throughput penalty. TurboQuant k8v4 offers only marginal gains over FP8 while degrading throughput and latency. TurboQuant 4bit-nc is viable for memory-constrained deployments with moderate accuracy tradeoffs. Aggressive variants (k3v4-nc, 3bit-nc) show up to 20-point accuracy drops on reasoning tasks and substantial performance degradation, making them unsuitable for production. The study covers latency, throughput, TPOT, and TTFT metrics under various serving conditions on H100 GPUs using vLLM.

A First Comprehensive Study of TurboQuant: Accuracy and Performance