TLX Block Attention: A Warp-Specialized Blackwell Kernel for Fixed-Block Sparse Self-Attention – PyTorch

TLX Block Attention is a Triton kernel for NVIDIA Blackwell GPUs that exploits compile-time knowledge of block-diagonal attention patterns to eliminate algorithmic overhead present in general-purpose attention implementations. By recognizing that every Q tile attends to exactly one K/V tile, the kernel eliminates multi-tile iteration, online softmax correction, logsumexp tensor storage, and the Di preprocessing kernel. A warp-specialized 5-stage forward pipeline and 7-stage backward pipeline are designed with asymmetric register allocation across hardware units. On B200 GPUs, it achieves ~1.85× forward and ~2.50× backward speedup over Flash Attention v2. Fusing rotary embeddings into the backward epilogue yields a 3.54× combined speedup and reduces BF16 quantization points from 2 to 1, improving gradient accuracy. The kernel is open-sourced and targets production recommendation/ads ranking workloads with batch sizes of 1152, sequences up to ~4k tokens, and ~70% attention sparsity.

#pytorch

Today•17m read time•From pytorch.org

Table of contents

1. Introduction 2. Why Block Attention?3. Kernel Architecture: A Warp-Specialized Pipeline 4. The Backward Pass: Gradients Without the Logsumexp Tensor 5. Scheduling for Variable-Length Sequences 6. Fused Rotary Backward: Higher Precision at Higher Speed 7. Performance Results 8. Applicability 9. Conclusion Acknowledgements References

Comment

Bookmark

Copy

Sort: