The post provides a detailed, iterative approach to optimizing a CUDA matrix multiplication (matmul) kernel to achieve performance close to NVIDIA's cuBLAS library. It covers topics like coalescing global memory accesses, utilizing shared memory, occupancy optimization, and achieving higher arithmetic intensity. Included are
Table of contents
–Come work on kernels at Anthropic !–Kernel 1: Naive ImplementationKernel 2: Global Memory CoalescingKernel 3: Shared Memory Cache-BlockingKernel 4: 1D Blocktiling for Calculating Multiple Results per ThreadKernel 5: Increasing Arithmetic Intensity via 2D BlocktilingKernel 6: Vectorize SMEM and GMEM AccessesKernel 9: Autotuning I skipped kernels 7 and 8, which I wrote while figuring out how to best eliminate shared memory bank conflicts. They eliminate the conflicts but were overall still slower, so I won’t cover them here.Kernel 10: WarptilingWork in Progress: Kernel 11ConclusionFurther Resources and ReferencesSort: