IBM Research, Red Hat, and AMD teams developed a Triton-based attention backend for vLLM that achieves performance portability across NVIDIA, AMD, and Intel GPUs using a single kernel implementation. The backend is the default on AMD ROCm and serves as a fallback on other platforms. Key technical contributions include Q block grouping for tile size optimization, parallel tiled softmax (3D kernel) for decode workloads, and persistent kernels to enable efficient CUDA graph reuse. Benchmarks show the Triton backend reaches 100.7% of FlashAttention 3 performance on H100 and delivers a 5.8× speedup on AMD MI300, all with roughly 800 lines of code versus FlashAttention 3's ~70,000. A preview of paged attention implemented in Helion (a higher-level DSL from PyTorch) is also discussed.
Table of contents
Why Triton Helps vLLMThe Triton Attention Backend in vLLMWhen the Triton Attention Backend Is UsedWriting a High-Performance Portable Paged Attention Kernel in TritonReminder: What the Paged Attention Kernel DoesOptimizing Tile Sizes for tl.dot Using Q BlocksAdding Parallelization With Parallel Tiled SoftmaxCUDA Graphs, Launch Grids, and GPU Execution WavesFrom Variable Launch Grids to Persistent KernelsBenchmarking ResultsPreview: Paged Attention in HelionConclusionAcknowledgmentsSort: