How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

The post provides a detailed, iterative approach to optimizing a CUDA matrix multiplication (matmul) kernel to achieve performance close to NVIDIA's cuBLAS library. It covers topics like coalescing global memory accesses, utilizing shared memory, occupancy optimization, and achieving higher arithmetic intensity. Included are

Jul 26, 2024•42m read time•From siboehm.com

Table of contents

–Come work on kernels at Anthropic !–Kernel 1: Naive Implementation Kernel 2: Global Memory Coalescing Kernel 3: Shared Memory Cache-Blocking Kernel 4: 1D Blocktiling for Calculating Multiple Results per Thread Kernel 5: Increasing Arithmetic Intensity via 2D Blocktiling Kernel 6: Vectorize SMEM and GMEM Accesses Kernel 9: Autotuning I skipped kernels 7 and 8, which I wrote while figuring out how to best eliminate shared memory bank conflicts. They eliminate the conflicts but were overall still slower, so I won’t cover them here.Kernel 10: Warptiling Work in Progress: Kernel 11 Conclusion Further Resources and References

Comment

Bookmark

Copy

Sort: