Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each

vLLM

A comprehensive technical deep-dive into FP8 KV-cache and attention quantization in vLLM, covering bugs found and fixed, performance benchmarks, and accuracy evaluations. Key findings: a critical accuracy regression on Hopper GPUs at long contexts (91% → 13% on needle-in-a-haystack) was traced to FP32 accumulation precision loss in Tensor Cores and fixed via two-level accumulation. After improvements, FP8 achieves 54% of BF16's ITL slope for Llama-3.1-8B (break-even at ~7k tokens), and 14.9% higher output throughput under load. For hybrid models with sliding-window attention, a new --kv-cache-dtype-skip-layers flag avoids quantizing layers that don't benefit. Accuracy benchmarks across Llama, Qwen3, and MoE models show 94–99% recovery of baseline scores. Caveats include head_dim=256 models where prefill regresses, and models using non-standard attention backends (e.g., FlashMLA) that may need calibrated scales.