In this video, we dive deep into the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", which introduces GRPO (Group Relative Policy Optimization)—a novel reinforcement learning (RL) algorithm used to train DeepSeek-R1.

DeepSeekMath is a model by DeepSeek designed specifically to excel at mathematical reasoning. We walk through its full training process, which closely mirrors how general-purpose large language models (LLMs) are trained. One of the key stages in this pipeline is reinforcement learning using GRPO.

Since GRPO builds upon PPO (Proximal Policy Optimization), we first provide a high-level overview of PPO before diving into GRPO’s innovations and how it removes the need for a value model.

Paper - https://arxiv.org/abs/2402.03300
Written Review - https://aipapersacademy.com/deepseekmath-grpo/
___________________
🔔 Subscribe for more AI paper reviews!

📩 Join the newsletter → https://aipapersacademy.com/newsletter/

Become a patron - https://www.patreon.com/aipapersacademy

The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:35 Math Pre-Training
4:55 Instruction-Tuning
5:45 PPO
7:45 GRPO
9:35 GRPO Objective

AI Papers Academy

A detailed walkthrough of the DeepSeekMath paper, covering the full training pipeline for a math-specialized LLM and introducing GRPO (Group Relative Policy Optimization), the reinforcement learning algorithm behind DeepSeek R1. The pipeline includes iterative math data curation from Common Crawl, supervised fine-tuning, and RL with GRPO. GRPO improves on PPO by eliminating the value model, instead sampling multiple outputs per prompt and normalizing rewards across the group to estimate advantage. The optimization objective is explained in plain terms, including the policy ratio, clipping for training stability, KL penalty placement, and token-level gradient propagation. Both outcome supervision and process supervision variants are described.

GRPO Reinforcement Learning Explained (DeepSeekMath Paper)