A recent paper titled 'Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning' investigates why RL enables reasoning in large language models and explains the 'aha moments' observed during training. The paper proposes that LLMs already contain latent hierarchical reasoning from pretraining, and RL unlocks it through two phases: first mastering low-level procedural execution, then expanding high-level strategic planning. This phase shift explains aha moments. The researchers introduce a new metric called semantic diversity to track strategic planning evolution, and propose HICRA (Hierarchy-Aware Credit Assignment), an improvement over GRPO that amplifies learning signals for strategic planning tokens. HICRA consistently outperforms GRPO across mathematical and multimodal reasoning benchmarks.
Sort: