vLLM now supports DeepSeek V4 models (V4-Pro at 1.6T parameters and V4-Flash at 285B), both supporting up to 1 million token contexts. The post explains DeepSeek V4's new attention mechanism, which combines shared key-value vectors, compressed KV caches (c4a at 4x compression and c128a at 128x compression), sparse attention, and a short sliding window for locality. Together these yield an 8.7x KV cache reduction versus DeepSeek V3.2. The vLLM implementation addresses heterogeneous attention types through a unified 256-token logical block size, compressor state managed as sliding-window KV, and page-size bucketing into three pools. GPU efficiency is achieved via three kernel fusions (delivering 1.4–20x speedups), multi-stream parallelism for a 5–6% latency reduction, and CUDA graph integration.

13m read timeFrom vllm.ai
Post cover image
Table of contents
Running DeepSeek V4 on vLLMDeepSeek V4's Attention Mechanism ExplainedvLLM's Implementation of DeepSeek V4Planned WorkAcknowledgmentsAppendix: The Math behind DeepSeek V4's Attention Mechanism

Sort: