METR researchers had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests to assess whether benchmark scores translate to real-world usefulness. The key finding: maintainer merge rates are on average 24 percentage points lower than SWE-bench automated grader scores, meaning roughly
Table of contents
IntroductionData and MethodsResultsConclusionA1. Conditional Maintainer Merge RateA2. Sample RepresentativenessA3. False Negative CorrectionA4. Raw (Unnormalized) Pass RatesA5. Raw (Unnormalized) Progress-Based Pass RatesA6. SOTA Models OnlyA7. Results by RepositoryA8. Ordering EffectsA9. Time Horizon AnalysisSort: