IBM Research, Red Hat, and AMD teams developed a Triton-based attention backend for vLLM that achieves performance portability across NVIDIA, AMD, and Intel GPUs using a single kernel implementation. The backend is the default on AMD ROCm and serves as a fallback on other platforms. Key technical contributions include Q block

9m read time From blog.vllm.ai
Post cover image
Table of contents
Why Triton Helps vLLMThe Triton Attention Backend in vLLMWhen the Triton Attention Backend Is UsedWriting a High-Performance Portable Paged Attention Kernel in TritonReminder: What the Paged Attention Kernel DoesOptimizing Tile Sizes for tl.dot Using Q BlocksAdding Parallelization With Parallel Tiled SoftmaxCUDA Graphs, Launch Grids, and GPU Execution WavesFrom Variable Launch Grids to Persistent KernelsBenchmarking ResultsPreview: Paged Attention in HelionConclusionAcknowledgments

Sort: