DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) is a new reinforcement learning system built on top of GRPO, developed by ByteDance researchers who struggled to reproduce DeepSeek R1 results. Their initial GRPO attempt scored only 30 points on AIME 2024 vs DeepSeek R1's 47. DAPO introduces four key improvements: (1) Clip Higher — decoupling upper and lower clipping bounds to prevent entropy collapse and allow more exploration; (2) Dynamic Sampling — filtering out questions where all sampled responses are correct or all wrong to maintain effective batch size and training signal; (3) Token-level Loss — averaging losses across all tokens rather than per-response to better handle varying response lengths; and (4) Overlong Response Punishment — either filtering truncated responses or applying a soft graduated penalty. Combined, these techniques push DAPO to 50 points on AIME 2024, outperforming DeepSeek R1 using the same base model with only 50% of training steps. ByteDance has open-sourced both the training data and code.
Sort: