Generating State-of-the-Art GEMMs with TorchInductor’s CuteDSL backend – PyTorch

TorchInductor now supports CuteDSL (NVGEMM) as a fourth autotuning backend for matrix multiplications, alongside Triton, CUTLASS C++, and cuBLAS. CuteDSL is a Python-based DSL built on the same abstractions as CUTLASS C++ but compiles via a custom Python-to-MLIR compiler, achieving compile times comparable to Triton while exposing full thread and memory hierarchy control. The backend queries NVIDIA's cutlass_api for compatible kernel configurations, uses nvMatmulHeuristics to narrow hundreds of candidates to a handful, then benchmarks only the top-ranked ones. On an NVIDIA B200 GPU, NVGEMM delivers kernel-level speedups of up to 1.78x over existing backends on BF16, MXFP8, and NVFP4 GEMMs, particularly for decode-regime (small-M) shapes. End-to-end vLLM inference on Llama and Qwen3 models shows 2–6.5% latency reductions. The backend is purely additive — unsupported operations fall back to other backends automatically. Future work includes epilogue fusion, parallel precompilation, exportable config caches, and eventual replacement of the CUTLASS C++ backend.

#pytorch

#ai-inference

Apr 07•14m read time•From pytorch.org

Table of contents

Introduction Strategy: Why We Target GEMMs Background: How TorchInductor Generates GEMMs Architecture of the CuteDSL Backend Results CuteDSL Backend Supported Features How You Can Try It Future Work Conclusion

Comment

Bookmark

Copy

Sort: