vLLM now offers 7 attention backends on AMD ROCm, moving beyond simple porting to hardware-aware co-design. The flagship ROCM_AITER_FA backend uses 3-path routing (Prefill, Extend, Decode) with specialized kernels for each workload type, a preshuffled KV cache layout aligned to AMD CDNA architecture, and batch

15m read time From blog.vllm.ai
Post cover image
Table of contents
IntroductionThe Challenge: Mixed Workloads in Every BatchOther MHA BackendsThe ROCM_AITER_FA Backend: Kernel Orchestration for AMDThe AITER MLA Backends: Optimized for DeepSeekPerformance BenchmarksThe Collaboration: vLLM + AITERGet StartedConclusionAcknowledgementsResourcesDisclaimer

Sort: