vLLM now offers 7 attention backends on AMD ROCm, moving beyond simple porting to hardware-aware co-design. The flagship ROCM_AITER_FA backend uses 3-path routing (Prefill, Extend, Decode) with specialized kernels for each workload type, a preshuffled KV cache layout aligned to AMD CDNA architecture, and batch reordering—delivering 2.7–4.4x higher throughput over the legacy ROCM_ATTN backend on MHA models like Qwen3-235B. For DeepSeek's MLA architecture, ROCM_AITER_MLA and ROCM_AITER_TRITON_MLA use a hand-tuned assembly decode kernel (mla_decode_fwd) that achieves 1.2–1.5x higher TPS than the Triton baseline. Benchmarks cover MI300X, MI325X, and MI355X GPUs. The recommended setup is simply setting VLLM_ROCM_USE_AITER=1 and letting vLLM auto-select the optimal backend.

15m read timeFrom blog.vllm.ai
Post cover image
Table of contents
IntroductionThe Challenge: Mixed Workloads in Every BatchOther MHA BackendsThe ROCM_AITER_FA Backend: Kernel Orchestration for AMDThe AITER MLA Backends: Optimized for DeepSeekPerformance BenchmarksThe Collaboration: vLLM + AITERGet StartedConclusionAcknowledgementsResourcesDisclaimer

Sort: