Stanford researchers introduce HipKittens, a C++ programming framework for writing high-performance AMD GPU kernels that matches or exceeds AMD's hand-optimized assembly implementations. The framework uses tile-based abstractions that generalize across GPU architectures, achieving state-of-the-art performance on attention mechanisms, GEMM operations, and other AI workloads with significantly less code (~500 lines vs raw assembly). The work addresses the software gap preventing AMD GPUs from competing with NVIDIA in AI workloads, demonstrating that peak AMD performance is achievable without raw assembly programming.
Table of contents
Building towards multi-silicon AI systemsClimbing out of the CUDA moat: Introducing HipKittensMulti-silicon AI is coming!Sort: