Researchers at CMU introduce POPE (Privileged On-Policy Exploration), a method to scale reinforcement learning training of large language models on difficult problems. Current RL approaches plateau on hard problems because they rely on on-policy sampling that fails to discover correct solutions. POPE addresses this by conditioning models on prefixes of human solutions as guidance during training, enabling exploration in states where reward is more accessible. This guided exploration improves solvability by ~13% compared to standard approaches while avoiding optimization pathologies like entropy collapse. The method works through a stitching mechanism where instruction-following and reasoning capabilities allow models to transfer learning from guided to unguided problem variants.

Table of contents
Addressing Exploration is Crucial for RL ScalingPOPE: Privileged On-Policy ExplorationDiscussion and Future PerspectivesSort: