This is a (very) quick, one-minute summary of the development of Policy Gradient algorithms over the past 30 years.

Check out this video for a more detailed explanation. https://youtu.be/mg-iU-WxiNs

References:
- REINFORCE https://link.springer.com/content/pdf/10.1007/BF00992696.pdf
- Actor-critic: https://arxiv.org/abs/1602.01783
- GAE: https://arxiv.org/abs/1506.02438
- TRPO: https://arxiv.org/abs/1502.05477
- PPO: https://arxiv.org/abs/1707.06347
- GRPO: https://arxiv.org/pdf/2402.03300
- DeepSeek-R1: https://arxiv.org/abs/2501.12948
- Dr. GRPO: https://arxiv.org/abs/2503.20783

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

A concise overview of policy gradient algorithms in reinforcement learning. Covers the core idea of using gradient updates to improve a neural network policy, the log-derivative trick for gradient approximation, baseline subtraction with the value function to handle positive rewards, bias-variance tradeoff via TD and multi-step rollouts, importance sampling for surrogate objectives, and PPO-style clipping for stable optimization. Also touches on GRPO-style normalized reward advantages used in reasoning models.

Policy Gradient in One Minute