Prefill and decode phases in LLM inference compete for the same GPU resources, causing inter-token latency (ITL) spikes under load. This post demonstrates how to implement Prefill-Decode (PD) disaggregation on a single 8-GPU AMD Instinct MI300X node using AMD's MORI-IO RDMA-based KV cache connector in vLLM. By dedicating 4 GPUs to prefill and 4 to decode, the setup achieves 2.5x higher goodput compared to standard collocated serving, with ITL violations eliminated entirely. Two transfer modes are covered: read mode (decode pulls KV cache serially) and write mode (prefill pushes KV cache concurrently, reducing TTFT overhead). Benchmarks use Qwen3-235B-A22B-FP8 at 8 req/s with 2000-token prompts and 1000-token outputs. The post includes architecture details, trade-off analysis, setup instructions, and full reproducible configurations.

18m read timeFrom vllm.ai
Post cover image
Table of contents
IntroductionKey HighlightsThe Misconception: "Disaggregation is Only for Datacenter Clusters"The Architecture: Serving with PD DisaggregationResults: 2.5x Higher GoodputUnderstanding the Trade-offsHow to Set It UpExperimental DetailsConclusions and Way ForwardAppendix: Reproducible ConfigurationsAcknowledgementsReferencesDisclaimer

Sort: