A visual walkthrough of how machines learn through optimization, starting from basic gradient descent and building up to AdamW. Covers the intuition behind stochastic gradient descent, momentum (moving average of gradients), RMSProp (adaptive learning rates via moving average of squared gradients), and how Adam combines both. Explains the bias correction step in Adam, then identifies why Adam generalizes worse than SGD with momentum — its L2 regularization is inadvertently scaled by the second moment estimate, weakening weight decay. AdamW fixes this by decoupling weight decay from the gradient update step, significantly improving generalization.

7m watch time

Sort: