TLX Block Attention is a Triton kernel for NVIDIA Blackwell GPUs that exploits compile-time knowledge of block-diagonal attention patterns to eliminate algorithmic overhead present in general-purpose attention implementations. By recognizing that every Q tile attends to exactly one K/V tile, the kernel eliminates multi-tile iteration, online softmax correction, logsumexp tensor storage, and the Di preprocessing kernel. A warp-specialized 5-stage forward pipeline and 7-stage backward pipeline are designed with asymmetric register allocation across hardware units. On B200 GPUs, it achieves ~1.85× forward and ~2.50× backward speedup over Flash Attention v2. Fusing rotary embeddings into the backward epilogue yields a 3.54× combined speedup and reduces BF16 quantization points from 2 to 1, improving gradient accuracy. The kernel is open-sourced and targets production recommendation/ads ranking workloads with batch sizes of 1152, sequences up to ~4k tokens, and ~70% attention sparsity.

17m read timeFrom pytorch.org
Post cover image
Table of contents
1. Introduction2. Why Block Attention?3. Kernel Architecture: A Warp-Specialized Pipeline4. The Backward Pass: Gradients Without the Logsumexp Tensor5. Scheduling for Variable-Length Sequences6. Fused Rotary Backward: Higher Precision at Higher Speed7. Performance Results8. Applicability9. ConclusionAcknowledgementsReferences

Sort: