Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

vLLM now offers 7 attention backends on AMD ROCm, moving beyond simple porting to hardware-aware co-design. The flagship ROCM_AITER_FA backend uses 3-path routing (Prefill, Extend, Decode) with specialized kernels for each workload type, a preshuffled KV cache layout aligned to AMD CDNA architecture, and batch reordering—delivering 2.7–4.4x higher throughput over the legacy ROCM_ATTN backend on MHA models like Qwen3-235B. For DeepSeek's MLA architecture, ROCM_AITER_MLA and ROCM_AITER_TRITON_MLA use a hand-tuned assembly decode kernel (mla_decode_fwd) that achieves 1.2–1.5x higher TPS than the Triton baseline. Benchmarks cover MI300X, MI325X, and MI355X GPUs. The recommended setup is simply setting VLLM_ROCM_USE_AITER=1 and letting vLLM auto-select the optimal backend.

#deepseek

#vllm

Feb 27•15m read time•From blog.vllm.ai

Table of contents

Introduction The Challenge: Mixed Workloads in Every Batch Other MHA Backends The ROCM_AITER_FA Backend: Kernel Orchestration for AMD The AITER MLA Backends: Optimized for DeepSeek Performance Benchmarks The Collaboration: vLLM + AITER Get Started Conclusion Acknowledgements Resources Disclaimer

Comment

Bookmark

Copy

Sort: