This post covers how to use the emerging optimizer, specifically the Newton-schulz based optimizer called 'muon', for large-scale LLM training on the GB300 NVL72 system. It introduces key enabling technologies: a layer-wise distributed optimizer (vs. element-wise for AdamW), distributed Newton-schulz iterations (duplicated, distributed, and blockwise modes), and upcoming optimizations like communication hiding, load balancing, and fused kernels. Performance benchmarks show near-parity with AdamW on models like kimi K2 and qwen3 30B. The implementation is integrated into megatron core and available in the NeMo emerging optimizers repo.
Table of contents
Muon training performance on NVIDIA GB300 NVL72Enabling technologies for large-scale Muon trainingWhat other optimizers does NVIDIA support for research?Reproducing Muon training resultsGet started with emerging optimizers for LLM trainingSort: