NVIDIA's GDPO (Group reward-Decoupled Normalization Policy Optimization) addresses a key limitation of GRPO when applied to multi-reward reinforcement learning for LLMs. Standard GRPO naively sums multiple reward signals before computing advantages, causing distinct reward combinations to collapse into identical advantage values and losing important training information. GDPO fixes this by first normalizing each reward type independently within the sampled group (reward-decoupled normalization), then summing the per-reward advantages, and finally applying batch-level normalization for training stability. This produces significantly more distinct advantage values, preserving richer learning signals. Experiments fine-tuning Qwen2.5-Instruct (1.5B and 3B) show GDPO outperforms GRPO on both tool-calling accuracy and format compliance.
Sort: