GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

CUTLASS 3.x introduces a hierarchical system of composable abstractions for GEMM kernel design on NVIDIA GPUs. The redesign features five layers (Atom, Tiled MMA/Copy, Collective, Kernel, and Device) that allow developers to build highly customizable GEMM implementations while maximizing code reuse. Key improvements include better support for Hopper and Blackwell architectures, enhanced code readability, and advanced features like warp-specialization schemes, tile scheduling, and fusion operations through Epilogue Visitor Trees.

CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design

A new conceptual GEMM hierarchy in CUTLASS 3.x