Reinforcement learning algorithms are the key driving force for training reasoning LLMs (e.g., DeepSeek-R1, Google's Gemini pro, OpenAI's o1/o3).

This video provides an overview of the key ideas of these reinforcement learning algorithms, covering the development from REINFORCE, Value function estimation, Actor-critic methods, Generalized Advantage Estimation, TRPO, PPO, and GRPO.

00:00 Introduction
00:43 Notation
02:41 Policy gradient
05:11 Decomposing trajectory into states and actions
07:05 Baseline subtraction
07:58 Value function estimation
08:31 Advantage estimation 
11:11 Actor-critic methods
12:16 Trust region policy optimization
16:48 ProximalPolicyOptimization
19:55 Group Relative Policy Optimization
21:58 Dr. GRPO

=== Resources ===
Three excellent resources I found particularly useful (if you are interested in learning more).
- Foundations of Deep RL -- 6-lecture series by Pieter Abbeel https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9RdmB3Z9N5-0IlY0

- DeepMind x UCL | Introduction to Reinforcement Learning by David Silver
https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ

- Reinforcement Learning: An Introduction http://www.incompleteideas.net/book/the-book-2nd.html

=== References === 
- REINFORCE https://link.springer.com/content/pdf/10.1007/BF00992696.pdf
- Actor-critic: https://arxiv.org/abs/1602.01783
- GAE: https://arxiv.org/abs/1506.02438
- TRPO: https://arxiv.org/abs/1502.05477
- PPO: https://arxiv.org/abs/1707.06347
- GRPO: https://arxiv.org/pdf/2402.03300
- DeepSeek-R1: https://arxiv.org/abs/2501.12948
- Dr. GRPO: https://arxiv.org/abs/2503.20783

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

A comprehensive walkthrough of how large language models are trained to reason using reinforcement learning, building from first principles. Covers policy gradient methods, the log-derivative trick, Monte Carlo estimation, actor-critic methods, trust region policy optimization (TRPO), PPO (both penalty and clip variants), and culminates in Group Relative Policy Optimization (GRPO) — the algorithm behind DeepSeek R1. Also explains generalized advantage estimation (GAE), importance sampling, bias-variance tradeoffs, and the Dr. GRPO refinement that addresses length and difficulty biases.

How LLMs Learn to Reason [GRPO]