A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

vLLM

vLLM now supports DeepSeek V4 models (V4-Pro at 1.6T parameters and V4-Flash at 285B), both supporting up to 1 million token contexts. The post explains DeepSeek V4's new attention mechanism, which combines shared key-value vectors, compressed KV caches (c4a at 4x compression and c128a at 128x compression), sparse attention, and a short sliding window for locality. Together these yield an 8.7x KV cache reduction versus DeepSeek V3.2. The vLLM implementation addresses heterogeneous attention types through a unified 256-token logical block size, compressor state managed as sliding-window KV, and page-size bucketing into three pools. GPU efficiency is achieved via three kernel fusions (delivering 1.4–20x speedups), multi-stream parallelism for a 5–6% latency reduction, and CUDA graph integration.

DeepSeek V4 in vLLM: Efficient Long-context Attention

DeepSeek V4's Attention Mechanism Explained

Appendix: The Math behind DeepSeek V4's Attention Mechanism