A comprehensive benchmarking study comparing TurboQuant KV-cache quantization variants against FP8 and BF16 baselines across four large models (30B–200B+ parameters) and five benchmarks. Key findings: FP8 remains the best default for KV-cache quantization, offering 2x capacity with negligible accuracy loss and no throughput penalty. TurboQuant k8v4 offers only marginal gains over FP8 while degrading throughput and latency. TurboQuant 4bit-nc is viable for memory-constrained deployments with moderate accuracy tradeoffs. Aggressive variants (k3v4-nc, 3bit-nc) show up to 20-point accuracy drops on reasoning tasks and substantial performance degradation, making them unsuitable for production. The study covers latency, throughput, TPOT, and TTFT metrics under various serving conditions on H100 GPUs using vLLM.

10m read timeFrom vllm.ai
Post cover image
Table of contents
IntroductionExperimental SetupAccuracy ResultsPerformance ResultsKey Findings and Recommendations

Sort: