Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method enhancing mathematical reasoning in language models. It simplifies training and reduces memory consumption by eliminating the need for a value function model, using group scores instead. Unlike traditional Proximal Policy Optimization (PPO), GRPO
Sort: