Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: http://incogni.com/welchlabs

New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs

Books & Posters
https://www.welchlabs.com/resources

Sections
0:00 - Intro
2:08 - No more spam calls w/ Incogni
3:45 - Toy Model
5:20 - y=mx+b
6:17 - Softmax
7:48 - Cross Entropy Loss
9:08 - Computing Gradients
12:31 - Backpropagation
18:23 - Gradient Descent
20:17 - Watching our Model Learn
23:53 - Scaling Up
25:45 - The Map of Language
28:13 - The time I quit YouTube
29:48 - New Patreon Rewards!

Special Thanks to Patrons https://www.patreon.com/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely

References
Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4. 
Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393.
Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461.

Historical Videos
http://youtube.com/watch?v=FwFduRA_L6Q
https://www.youtube.com/watch?v=ntIczNQKfjQ

Code: 
https://github.com/stephencwelch/manim_videos

Technical Notes
Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer. 
The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence. 
In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies. 

Written by: Stephen Welch 
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Special thanks to: Emily Zhang

Premium Beat IDs
EEDYZ3FP44YX8OWT
MWROXNAY0SPXCMBS

Welch Labs

Backpropagation, discovered by Paul Werbos in the 1970s, is the fundamental algorithm that trains virtually all modern AI models including large language models like LLaMA. The algorithm uses calculus and the chain rule to efficiently compute gradients - the slopes of the loss function with respect to each model parameter. These gradients guide the learning process by indicating how to adjust parameters to reduce prediction errors. The explanation demonstrates backpropagation through a simplified GPS coordinate classification model, showing how the algorithm scales from basic linear models to complex neural networks capable of learning intricate patterns in high-dimensional spaces.

The F=ma of Artificial Intelligence

<p>Looking forward to watch this as a break before math exam by friday. Looks like the perfect math procrastination video! :D</p>