LLM inference has two fundamentally different phases: prefill (compute-bound, 90-95% GPU utilization) and decode (memory-bound, 20-40% GPU utilization). Running both on the same GPU pool wastes resources and causes latency interference. Disaggregated inference solves this by routing each phase to separate, purpose-sized GPU pools connected via high-speed networking. The KV-cache produced during prefill is transferred over RDMA to decode workers, replacing queuing delays with predictable transfer latency. Production deployments at Perplexity, Meta, and LinkedIn report 2-6x throughput gains and 15-40% infrastructure cost reductions. The approach is not universally beneficial — short prompts, high prefix cache hit rates, small GPU clusters (<16 GPUs), and slow networks can make disaggregation counterproductive. A practical decision framework covers prefill-to-decode time ratio, KV-cache transfer size, prefix cache hit rate, GPU count, and network capabilities. Tools like vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo all support disaggregated serving natively.
Table of contents
The two phases of inference are not the same workloadWhat monolithic serving actually costs youSplitting the inference path in twoThe tax you pay: moving the KV-cacheWhat the production stack looks like todayWhen disaggregation makes things worseThe cost arithmeticShould you disaggregate? A decision frameworkWhat comes nextSort: