Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.

LLM inference has two fundamentally different phases: prefill (compute-bound, 90-95% GPU utilization) and decode (memory-bound, 20-40% GPU utilization). Running both on the same GPU pool wastes resources and causes latency interference. Disaggregated inference solves this by routing each phase to separate, purpose-sized GPU pools connected via high-speed networking. The KV-cache produced during prefill is transferred over RDMA to decode workers, replacing queuing delays with predictable transfer latency. Production deployments at Perplexity, Meta, and LinkedIn report 2-6x throughput gains and 15-40% infrastructure cost reductions. The approach is not universally beneficial — short prompts, high prefix cache hit rates, small GPU clusters (<16 GPUs), and slow networks can make disaggregation counterproductive. A practical decision framework covers prefill-to-decode time ratio, KV-cache transfer size, prefix cache hit rate, GPU count, and network capabilities. Tools like vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo all support disaggregated serving natively.

#ai-inference

#vllm

Apr 15•15m read time•From towardsdatascience.com

Table of contents

The two phases of inference are not the same workload What monolithic serving actually costs you Splitting the inference path in two The tax you pay: moving the KV-cache What the production stack looks like today When disaggregation makes things worse The cost arithmetic Should you disaggregate? A decision framework What comes next

Comment

Bookmark

Copy

Sort: