Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method enhancing mathematical reasoning in language models. It simplifies training and reduces memory consumption by eliminating the need for a value function model, using group scores instead. Unlike traditional Proximal Policy Optimization (PPO), GRPO integrates a KL divergence term directly into the loss function, stabilizing training and improving performance. Applied to the DeepSeekMath model, GRPO showed significant performance improvements in mathematical tasks.
Sort: