In this post, I will walkthrough how I learned to implement Flash Attention for 5090 in CUDA C++. The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120. I also feel this is a natural next step after learning about matmul kernels. Lastly, there are many excellent blogposts on writing fast matmul kernels, but there is none for attention. So I want to take this chance to write up something nicely.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

A comprehensive walkthrough of implementing Flash Attention for NVIDIA RTX 5090 in CUDA C++, progressing through five optimization versions. Starting with a basic implementation achieving 68% of theoretical peak performance, the author systematically applies optimizations including shared memory swizzling to eliminate bank conflicts, 2-stage pipelining for memory-compute overlap, and improved data loading patterns. The final version reaches 94.39% of speed-of-light performance (197.74 TFLOPS), demonstrating advanced CUDA programming techniques like Tensor Core usage, online softmax computation, and memory access optimization.

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++