TorchInductor now supports CuteDSL (NVGEMM) as a fourth autotuning backend for matrix multiplications, alongside Triton, CUTLASS C++, and cuBLAS. CuteDSL is a Python-based DSL built on the same abstractions as CUTLASS C++ but compiles via a custom Python-to-MLIR compiler, achieving compile times comparable to Triton while exposing full thread and memory hierarchy control. The backend queries NVIDIA's cutlass_api for compatible kernel configurations, uses nvMatmulHeuristics to narrow hundreds of candidates to a handful, then benchmarks only the top-ranked ones. On an NVIDIA B200 GPU, NVGEMM delivers kernel-level speedups of up to 1.78x over existing backends on BF16, MXFP8, and NVFP4 GEMMs, particularly for decode-regime (small-M) shapes. End-to-end vLLM inference on Llama and Qwen3 models shows 2–6.5% latency reductions. The backend is purely additive — unsupported operations fall back to other backends automatically. Future work includes epilogue fusion, parallel precompilation, exportable config caches, and eventual replacement of the CUTLASS C++ backend.

14m read timeFrom pytorch.org
Post cover image
Table of contents
IntroductionStrategy: Why We Target GEMMsBackground: How TorchInductor Generates GEMMsArchitecture of the CuteDSL BackendResultsCuteDSL Backend Supported FeaturesHow You Can Try ItFuture WorkConclusion

Sort: