Higher-order optimization algorithms such as Shampoo have been effectively applied in neural network training for at least a decade. These methods have achieved…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

This post covers how to use the emerging optimizer, specifically the Newton-schulz based optimizer called 'muon', for large-scale LLM training on the GB300 NVL72 system. It introduces key enabling technologies: a layer-wise distributed optimizer (vs. element-wise for AdamW), distributed Newton-schulz iterations (duplicated, distributed, and blockwise modes), and upcoming optimizations like communication hiding, load balancing, and fused kernels. Performance benchmarks show near-parity with AdamW on models like kimi K2 and qwen3 30B. The implementation is integrated into megatron core and available in the NeMo emerging optimizers repo.

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron

Muon training performance on NVIDIA GB300 NVL72

Enabling technologies for large-scale Muon training

What other optimizers does NVIDIA support for research?

Get started with emerging optimizers for LLM training