How do machines learn? In this video, we review the basic ideas of optimizers, algorithms that efficiently update the parameters of deep neural networks and minimize the loss function. We will cover gradient descent, momentum, RMSProp, Adam, and AdamW.

References:

[RMSProp] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

[Adam] Adam: A Method for Stochastic Optimization
https://arxiv.org/pdf/1412.6980

[AdamW] Decoupled Weight Decay Regularization 
https://openreview.net/forum?id=Bkg6RiCqY7

Jia-Bin Huang

A visual walkthrough of how machines learn through optimization, starting from basic gradient descent and building up to AdamW. Covers the intuition behind stochastic gradient descent, momentum (moving average of gradients), RMSProp (adaptive learning rates via moving average of squared gradients), and how Adam combines both. Explains the bias correction step in Adam, then identifies why Adam generalizes worse than SGD with momentum — its L2 regularization is inadvertently scaled by the second moment estimate, weakening weight decay. AdamW fixes this by decoupling weight decay from the gradient update step, significantly improving generalization.

The Algorithm that Helps Machines Learn [AdamW]