Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

A deep-dive into implementing and optimizing Flash Attention using NVIDIA's cuTile Python library on Blackwell GPUs. Covers the full kernel implementation including online softmax, causal masking, and grouped-query attention. The core of the post is a 'trap and rescue' optimization journey: naively increasing tile size from 64×64 to 256×128 degrades performance by 18-43%, but applying fast math (flush_to_zero, approximate division), K-loop splitting for causal masks, block ID remapping, and autotuning recovers and exceeds baseline by up to 1.66x. Each optimization step is backed by Nsight Compute profiling data showing registers, occupancy, and compute/memory throughput.

#cuda

Mar 04•21m read time•From developer.nvidia.com

Table of contents

What is attention?Understanding online softmax Causal attention and grouped-query attention Part 1: The flash attention kernel in CUDA Tile Launching the kernel: Host-side code Part 2: The “trap and rescue” optimization journey 1. The trap of larger tiles 2. The rescue with fast math 3. K-loop split 4. ProgramId remapping 5. Autotuning Summary: The optimization stack Getting started

Comment

Bookmark

Copy

Sort: