This Simple Optimizer Is Revolutionizing How We Train AI [Muon]
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
An explanation of the Muon optimizer, a new alternative to AdamW for training machine learning models. Muon works by orthogonalizing the momentum matrix using an iterative polynomial approximation of SVD, amplifying underrepresented gradient directions. This makes it roughly twice as computationally efficient as AdamW. The post also covers extensions for large-scale training: weight decay, learning rate scaling by matrix size, and QK Clip (and Muon Clip for multi-head latent attention) to prevent attention logit explosion during training.
•17m watch time
Sort: