METR researchers had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests to assess whether benchmark scores translate to real-world usefulness. The key finding: maintainer merge rates are on average 24 percentage points lower than SWE-bench automated grader scores, meaning roughly
•18m read time• From metr.org
Table of contents
IntroductionData and MethodsResultsConclusionA1. Conditional Maintainer Merge RateA2. Sample RepresentativenessA3. False Negative CorrectionA4. Raw (Unnormalized) Pass RatesA5. Raw (Unnormalized) Progress-Based Pass RatesA6. SOTA Models OnlyA7. Results by RepositoryA8. Ordering EffectsA9. Time Horizon AnalysisSort: