vLLM 0.11.0 introduced KV cache offloading to CPU memory, significantly improving LLM inference throughput by avoiding expensive recomputation. The feature uses an asynchronous connector API with DMA-based transfers, achieving 2-22x faster time-to-first-token for cached prompts and up to 9x throughput gains for concurrent

14m read time From blog.vllm.ai
Post cover image
Table of contents
MotivationThe New Offloading ConnectorBenefits of CPU Offloading via the Offloading ConnectorEvaluating GPU-CPU Transfer TechniquesChanging vLLM’s Memory LayoutEnd-to-end Evaluation of Copy MethodsEvaluation Setup and Benchmark Code

Sort: