NVIDIA recently introduced GDPO in a paper titled GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL
Optimization.

GDPO is a new reinforcement learning algorithm designed to fix GRPO’s limitations in multi-reward LLM training.

In this video, we explain how GDPO works, why standard GRPO fails with multiple rewards, and how reward-decoupled normalization improves advantage estimation and model performance.

Written Review - https://aipapersacademy.com/gdpo/
Paper - https://arxiv.org/abs/2601.05242
Code - https://github.com/NVlabs/GDPO
GRPO Deep Dive - https://aipapersacademy.com/deepseekmath-grpo/
___________________
🔔 Subscribe for more AI paper reviews!

📩 Join the newsletter → https://aipapersacademy.com/newsletter/

Patreon - https://www.patreon.com/aipapersacademy

The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:51 GRPO Recap
3:30 Multi-Reward GRPO
4:30 GRPO Reward Collapse
6:00 GDPO's Fix
7:26 GDPO Results

AI Papers Academy

NVIDIA's GDPO (Group reward-Decoupled Normalization Policy Optimization) addresses a key limitation of GRPO when applied to multi-reward reinforcement learning for LLMs. Standard GRPO naively sums multiple reward signals before computing advantages, causing distinct reward combinations to collapse into identical advantage values and losing important training information. GDPO fixes this by first normalizing each reward type independently within the sampled group (reward-decoupled normalization), then summing the per-reward advantages, and finally applying batch-level normalization for training stability. This produces significantly more distinct advantage values, preserving richer learning signals. Experiments fine-tuning Qwen2.5-Instruct (1.5B and 3B) show GDPO outperforms GRPO on both tool-calling accuracy and format compliance.

GDPO Explained: NVIDIA Fixes GRPO for LLM Reinforcement Learning