A comprehensive technical deep-dive into FP8 KV-cache and attention quantization in vLLM, covering bugs found and fixed, performance benchmarks, and accuracy evaluations. Key findings: a critical accuracy regression on Hopper GPUs at long contexts (91% → 13% on needle-in-a-haystack) was traced to FP32 accumulation precision loss in Tensor Cores and fixed via two-level accumulation. After improvements, FP8 achieves 54% of BF16's ITL slope for Llama-3.1-8B (break-even at ~7k tokens), and 14.9% higher output throughput under load. For hybrid models with sliding-window attention, a new --kv-cache-dtype-skip-layers flag avoids quantizing layers that don't benefit. Accuracy benchmarks across Llama, Qwen3, and MoE models show 94–99% recovery of baseline scores. Caveats include head_dim=256 models where prefill regresses, and models using non-standard attention backends (e.g., FlashMLA) that may need calibrated scales.
Table of contents
IntroductionThe Problems We FoundKernel and vLLM ImprovementsPerformance BenchmarkingAccuracy BenchmarkingWhen to Avoid FP8 KV-CacheSort: