IBM Research, Red Hat, and AMD teams developed a Triton-based attention backend for vLLM that achieves performance portability across NVIDIA, AMD, and Intel GPUs using a single kernel implementation. The backend is the default on AMD ROCm and serves as a fallback on other platforms. Key technical contributions include Q block grouping for tile size optimization, parallel tiled softmax (3D kernel) for decode workloads, and persistent kernels to enable efficient CUDA graph reuse. Benchmarks show the Triton backend reaches 100.7% of FlashAttention 3 performance on H100 and delivers a 5.8× speedup on AMD MI300, all with roughly 800 lines of code versus FlashAttention 3's ~70,000. A preview of paged attention implemented in Helion (a higher-level DSL from PyTorch) is also discussed.

9m read timeFrom blog.vllm.ai
Post cover image
Table of contents
Why Triton Helps vLLMThe Triton Attention Backend in vLLMWhen the Triton Attention Backend Is UsedWriting a High-Performance Portable Paged Attention Kernel in TritonReminder: What the Paged Attention Kernel DoesOptimizing Tile Sizes for tl.dot Using Q BlocksAdding Parallelization With Parallel Tiled SoftmaxCUDA Graphs, Launch Grids, and GPU Execution WavesFrom Variable Launch Grids to Persistent KernelsBenchmarking ResultsPreview: Paged Attention in HelionConclusionAcknowledgments

Sort: