A comprehensive technical deep-dive into FP8 KV-cache and attention quantization in vLLM, covering bugs found and fixed, performance benchmarks, and accuracy evaluations. Key findings: a critical accuracy regression on Hopper GPUs at long contexts (91% → 13% on needle-in-a-haystack) was traced to FP32 accumulation precision loss in Tensor Cores and fixed via two-level accumulation. After improvements, FP8 achieves 54% of BF16's ITL slope for Llama-3.1-8B (break-even at ~7k tokens), and 14.9% higher output throughput under load. For hybrid models with sliding-window attention, a new --kv-cache-dtype-skip-layers flag avoids quantizing layers that don't benefit. Accuracy benchmarks across Llama, Qwen3, and MoE models show 94–99% recovery of baseline scores. Caveats include head_dim=256 models where prefill regresses, and models using non-standard attention backends (e.g., FlashMLA) that may need calibrated scales.

17m read timeFrom vllm.ai
Post cover image
Table of contents
IntroductionThe Problems We FoundKernel and vLLM ImprovementsPerformance BenchmarkingAccuracy BenchmarkingWhen to Avoid FP8 KV-Cache

Sort: