TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust proces

vLLM

PegaFlow is an external KV cache service for vLLM, implemented as a standalone Rust process that moves KV cache lifetime outside the inference engine. It pools cache across local instances and remote nodes using a three-level hierarchy: pinned host DRAM, RDMA-accessible remote memory, and SSD via io_uring. Key results include 2.15x faster vLLM startup, 56% higher throughput for multi-instance Qwen3-8B sharing one host cache, 72% higher throughput for DeepSeek-V3.2 MLA with TP8 via logical KV deduplication, and 194 GB/s average remote-read throughput over RDMA. Integration is done through vLLM's existing kv_transfer_config connector interface without modifying vLLM source code. PegaFlow also provides HyperLogLog-based theoretical hit-rate ceiling estimation for operators to diagnose cache efficiency.

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

Faster restarts with external cache ownership

Rust data path and tail-latency stability

Measuring distance from the theoretical hit-rate ceiling

Integrating with vLLM through the external connector