PegaFlow is an external KV cache service for vLLM, implemented as a standalone Rust process that moves KV cache lifetime outside the inference engine. It pools cache across local instances and remote nodes using a three-level hierarchy: pinned host DRAM, RDMA-accessible remote memory, and SSD via io_uring. Key results include 2.15x faster vLLM startup, 56% higher throughput for multi-instance Qwen3-8B sharing one host cache, 72% higher throughput for DeepSeek-V3.2 MLA with TP8 via logical KV deduplication, and 194 GB/s average remote-read throughput over RDMA. Integration is done through vLLM's existing kv_transfer_config connector interface without modifying vLLM source code. PegaFlow also provides HyperLogLog-based theoretical hit-rate ceiling estimation for operators to diagnose cache efficiency.
Table of contents
Why KV cache needs a process boundaryFaster restarts with external cache ownershipRust data path and tail-latency stabilityPooling cache across instances and nodesResultsThree-level cache hierarchyMeasuring distance from the theoretical hit-rate ceilingIntegrating with vLLM through the external connectorQuick startPublic reference benchmarkTry PegaFlowAcknowledgementsSort: