Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method enhancing mathematical reasoning in language models. It simplifies training and reduces memory consumption by eliminating the need for a value function model, using group scores instead. Unlike traditional Proximal Policy Optimization (PPO), GRPO integrates a KL divergence term directly into the loss function, stabilizing training and improving performance. Applied to the DeepSeekMath model, GRPO showed significant performance improvements in mathematical tasks.

3m read timeFrom marktechpost.com
Post cover image

Sort: