vLLM 0.11.0 introduced KV cache offloading to CPU memory, significantly improving LLM inference throughput by avoiding expensive recomputation. The feature uses an asynchronous connector API with DMA-based transfers, achieving 2-22x faster time-to-first-token for cached prompts and up to 9x throughput gains for concurrent
Table of contents
MotivationThe New Offloading ConnectorBenefits of CPU Offloading via the Offloading ConnectorEvaluating GPU-CPU Transfer TechniquesChanging vLLM’s Memory LayoutEnd-to-end Evaluation of Copy MethodsEvaluation Setup and Benchmark CodeSort: