In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall inference throughput. In the second part of the blog, we deep dive into our efforts in optimizing host-to-device and device-to-host throughput for KV offloading.

vLLM

vLLM 0.11.0 introduced KV cache offloading to CPU memory, significantly improving LLM inference throughput by avoiding expensive recomputation. The feature uses an asynchronous connector API with DMA-based transfers, achieving 2-22x faster time-to-first-token for cached prompts and up to 9x throughput gains for concurrent requests. Version 0.12.0 improved performance by 4-5x through memory layout optimization, consolidating fragmented per-layer blocks into contiguous 0.5-2MB physical blocks. Benchmarks on H100 GPUs show DMA outperforms custom CUDA kernels for end-to-end throughput despite slightly higher latency, as it minimizes interference with model computation.

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Benefits of CPU Offloading via the Offloading Connector