FlexAttention now supports a FlashAttention-4 (FA4) backend on Hopper and Blackwell GPUs, delivering 1.2×–3.2× speedups over the existing Triton implementation on compute-bound workloads. The integration required extending FA4 with score-modification hooks and block-sparse iteration in both forward and backward passes, while

18m read timeFrom pytorch.org
Post cover image
Table of contents
Blackwell: bigger tensor cores, bigger problemsFlashAttention-4 as the foundationInductor → CuTeDSL: the glue layerFlexifying FlashAttention-4ResultsCorrectness and benchmark methodologyFuture workThanksFurther reading / links

Sort: