The Muon optimizer has demonstrated remarkable performance in accelerating machine learning model training, often outperforming the widely used AdamW optimizer. In this video, we will cover the basic concept of how Muon works and discuss some recent improvements that make it scalable for large-scale LLM training.

00:00 Why Muon?
00:36 Reviewing Adam
02:13 Linear layer
04:24 Solving orthogonalization with SVD
06:28 Newton-Schulz iteration - Odd polynomial matrix
08:11 Newton-Schulz iteration - Example
10:35 The Muon optimizer
11:49 The exploding attention logit crisis
15:13 MuonClip: Extending QK-clip to Multi-head Latent Attention (MLA)
17:24 Results of MuonClip

References:
- Muon: An optimizer for hidden layers in neural networks https://kellerjordan.github.io/posts/muon/
- Deriving Muon https://jeremybernste.in/writing/deriving-muon
- Old Optimizer, New Norm: An Anthology https://arxiv.org/abs/2409.20325
- Muon is Scalable for LLM Training: https://arxiv.org/abs/2502.16982
- MuonClip: https://arxiv.org/abs/2507.20534
- Fantastic pretraining optimizers and where to find them: https://arxiv.org/abs/2409.20325

Check out my other video to learn more about AdamW: https://youtu.be/1_nujVNUsto

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

An explanation of the Muon optimizer, a new alternative to AdamW for training machine learning models. Muon works by orthogonalizing the momentum matrix using an iterative polynomial approximation of SVD, amplifying underrepresented gradient directions. This makes it roughly twice as computationally efficient as AdamW. The post also covers extensions for large-scale training: weight decay, learning rate scaling by matrix size, and QK Clip (and Muon Clip for multi-head latent attention) to prevent attention logit explosion during training.

This Simple Optimizer Is Revolutionizing How We Train AI [Muon]