Meta's SWE-RL paper proposes scaling reinforcement learning for real-world software engineering tasks, addressing limitations of models like DeepSeek R1 that focus on competitive programming. Researchers curated a dataset of ~11 million high-quality GitHub pull requests from 4.6 million repositories, using issue descriptions, comments, and code context as training inputs. The model (Llama 3 SWE-RL 70B) is trained using Group Relative Policy Optimization (GRPO) with a rule-based reward derived from similarity between predicted and actual merged patches. The resulting model achieves 41% pass@1 on SWE-bench Verified, setting a new state-of-the-art for open-source models under 100B parameters. Notably, RL training produces emergent general reasoning behaviors — including divide-and-conquer strategies and self-reflection — that transfer to out-of-domain tasks not seen during training, consistently outperforming supervised fine-tuning.

8m watch time

Sort: