CUTLASS 3.x introduces a hierarchical system of composable abstractions for GEMM kernel design on NVIDIA GPUs. The redesign features five layers (Atom, Tiled MMA/Copy, Collective, Kernel, and Device) that allow developers to build highly customizable GEMM implementations while maximizing code reuse. Key improvements include better support for Hopper and Blackwell architectures, enhanced code readability, and advanced features like warp-specialization schemes, tile scheduling, and fusion operations through Epilogue Visitor Trees.

12m read timeFrom developer.nvidia.com
Post cover image
Table of contents
A new conceptual GEMM hierarchy in CUTLASS 3.xCollective layer: MainloopThe collective builderCollective layer: EpilogueKernel layerTile schedulingDevice layerSummary

Sort: