In this video, we break down DAPO: An Open-Source LLM Reinforcement Learning System at Scale — a new research paper from ByteDance that introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a powerful reinforcement learning (RL) algorithm built on GRPO (Grouped Relative Policy Optimization).

DAPO tackles key challenges in training large language models (LLMs) with RL, especially issues encountered when trying to reproduce DeepSeek-R1’s results. The researchers trained Qwen2.5-32B with DAPO, achieving 50 points on the challenging AIME 2024 benchmark — outperforming DeepSeek-R1's 47 points while using only 50% of the training steps.

Written Review - https://aipapersacademy.com/dapo/
Paper - https://arxiv.org/abs/2503.14476
Code & Dataset - https://github.com/BytedTsinghua-SIA/DAPO

#ai #reinforcementlearning  #llm #deepseek #grpo #dapo #rl #airesearch 
___________________
🔔 Subscribe for more AI paper reviews!

📩 Join the newsletter → https://aipapersacademy.com/newsletter/

Patreon - https://www.patreon.com/aipapersacademy

The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
2:30 Introducing DAPO
5:05 Clip-Higher
7:45 Dynamic Sampling
9:35 Token-Level Loss
11:13 Overlong Responses
12:23 Ablation Study
12:57 KL Divergence Removal

AI Papers Academy

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) is a new reinforcement learning system built on top of GRPO, developed by ByteDance researchers who struggled to reproduce DeepSeek R1 results. Their initial GRPO attempt scored only 30 points on AIME 2024 vs DeepSeek R1's 47. DAPO introduces four key improvements: (1) Clip Higher — decoupling upper and lower clipping bounds to prevent entropy collapse and allow more exploration; (2) Dynamic Sampling — filtering out questions where all sampled responses are correct or all wrong to maintain effective batch size and training signal; (3) Token-level Loss — averaging losses across all tokens rather than per-response to better handle varying response lengths; and (4) Overlong Response Punishment — either filtering truncated responses or applying a soft graduated penalty. Combined, these techniques push DAPO to 50 points on AIME 2024, outperforming DeepSeek R1 using the same base model with only 50% of training steps. ByteDance has open-sourced both the training data and code.

GRPO 2.0? DAPO LLM Reinforcement Learning Explained