vLLM 0.11.0 introduced KV cache offloading to CPU memory, significantly improving LLM inference throughput by avoiding expensive recomputation. The feature uses an asynchronous connector API with DMA-based transfers, achieving 2-22x faster time-to-first-token for cached prompts and up to 9x throughput gains for concurrent requests. Version 0.12.0 improved performance by 4-5x through memory layout optimization, consolidating fragmented per-layer blocks into contiguous 0.5-2MB physical blocks. Benchmarks on H100 GPUs show DMA outperforms custom CUDA kernels for end-to-end throughput despite slightly higher latency, as it minimizes interference with model computation.
Table of contents
MotivationThe New Offloading ConnectorBenefits of CPU Offloading via the Offloading ConnectorEvaluating GPU-CPU Transfer TechniquesChanging vLLM’s Memory LayoutEnd-to-end Evaluation of Copy MethodsEvaluation Setup and Benchmark CodeSort: