Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Prefill and decode phases in LLM inference compete for the same GPU resources, causing inter-token latency (ITL) spikes under load. This post demonstrates how to implement Prefill-Decode (PD) disaggregation on a single 8-GPU AMD Instinct MI300X node using AMD's MORI-IO RDMA-based KV cache connector in vLLM. By dedicating 4 GPUs to prefill and 4 to decode, the setup achieves 2.5x higher goodput compared to standard collocated serving, with ITL violations eliminated entirely. Two transfer modes are covered: read mode (decode pulls KV cache serially) and write mode (prefill pushes KV cache concurrently, reducing TTFT overhead). Benchmarks use Qwen3-235B-A22B-FP8 at 8 req/s with 2000-token prompts and 1000-token outputs. The post includes architecture details, trade-off analysis, setup instructions, and full reproducible configurations.

#ai-inference

#vllm

May 10•18m read time•From vllm.ai

Table of contents

Introduction Key Highlights The Misconception: "Disaggregation is Only for Datacenter Clusters"The Architecture: Serving with PD Disaggregation Results: 2.5x Higher Goodput Understanding the Trade-offs How to Set It Up Experimental Details Conclusions and Way Forward Appendix: Reproducible Configurations Acknowledgements References Disclaimer

Comment

Bookmark

Copy

Sort: