vLLM 0.11.0 introduced KV cache offloading to CPU memory, significantly improving LLM inference throughput by avoiding expensive recomputation. The feature uses an asynchronous connector API with DMA-based transfers, achieving 2-22x faster time-to-first-token for cached prompts and up to 9x throughput gains for concurrent requests. Version 0.12.0 improved performance by 4-5x through memory layout optimization, consolidating fragmented per-layer blocks into contiguous 0.5-2MB physical blocks. Benchmarks on H100 GPUs show DMA outperforms custom CUDA kernels for end-to-end throughput despite slightly higher latency, as it minimizes interference with model computation.

14m read timeFrom blog.vllm.ai
Post cover image
Table of contents
MotivationThe New Offloading ConnectorBenefits of CPU Offloading via the Offloading ConnectorEvaluating GPU-CPU Transfer TechniquesChanging vLLM’s Memory LayoutEnd-to-end Evaluation of Copy MethodsEvaluation Setup and Benchmark Code

Sort: