A comprehensive walkthrough of how large language models are trained to reason using reinforcement learning, building from first principles. Covers policy gradient methods, the log-derivative trick, Monte Carlo estimation, actor-critic methods, trust region policy optimization (TRPO), PPO (both penalty and clip variants), and culminates in Group Relative Policy Optimization (GRPO) — the algorithm behind DeepSeek R1. Also explains generalized advantage estimation (GAE), importance sampling, bias-variance tradeoffs, and the Dr. GRPO refinement that addresses length and difficulty biases.
•23m watch time
Sort: