vLLM Triton Attention Backend Deep Dive

IBM Research, Red Hat, and AMD teams developed a Triton-based attention backend for vLLM that achieves performance portability across NVIDIA, AMD, and Intel GPUs using a single kernel implementation. The backend is the default on AMD ROCm and serves as a fallback on other platforms. Key technical contributions include Q block grouping for tile size optimization, parallel tiled softmax (3D kernel) for decode workloads, and persistent kernels to enable efficient CUDA graph reuse. Benchmarks show the Triton backend reaches 100.7% of FlashAttention 3 performance on H100 and delivers a 5.8× speedup on AMD MI300, all with roughly 800 lines of code versus FlashAttention 3's ~70,000. A preview of paged attention implemented in Helion (a higher-level DSL from PyTorch) is also discussed.

#vllm

Mar 06•9m read time•From blog.vllm.ai

Table of contents

Why Triton Helps vLLM The Triton Attention Backend in vLLM When the Triton Attention Backend Is Used Writing a High-Performance Portable Paged Attention Kernel in Triton Reminder: What the Paged Attention Kernel Does Optimizing Tile Sizes for tl.dot Using Q Blocks Adding Parallelization With Parallel Tiled Softmax CUDA Graphs, Launch Grids, and GPU Execution Waves From Variable Launch Grids to Persistent Kernels Benchmarking Results Preview: Paged Attention in Helion Conclusion Acknowledgments

Comment

Bookmark

Copy

Sort: