A detailed walkthrough of the DeepSeekMath paper, covering the full training pipeline for a math-specialized LLM and introducing GRPO (Group Relative Policy Optimization), the reinforcement learning algorithm behind DeepSeek R1. The pipeline includes iterative math data curation from Common Crawl, supervised fine-tuning, and RL with GRPO. GRPO improves on PPO by eliminating the value model, instead sampling multiple outputs per prompt and normalizing rewards across the group to estimate advantage. The optimization objective is explained in plain terms, including the policy ratio, clipping for training stability, KL penalty placement, and token-level gradient propagation. Both outcome supervision and process supervision variants are described.

14m watch time

Sort: