METR researchers had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests to assess whether benchmark scores translate to real-world usefulness. The key finding: maintainer merge rates are on average 24 percentage points lower than SWE-bench automated grader scores, meaning roughly half of test-passing PRs would not be merged. After normalizing for noise in maintainer decisions using a golden baseline of human-written PRs, AI agents score about 50% of what the benchmark suggests. Rejection reasons include code quality issues, breaking unrelated code, and core functionality failures. The study also finds the rate of improvement in maintainer-assessed quality may be slower than benchmark scores suggest, though this finding is weaker. Importantly, the authors note this is not a fundamental capability ceiling — agents were not given the chance to iterate based on feedback as human developers would. The conclusion is that SWE-bench scores should be treated as one signal among many, not a direct proxy for real-world developer productivity.

18m read timeFrom metr.org
Post cover image
Table of contents
IntroductionData and MethodsResultsConclusionA1. Conditional Maintainer Merge RateA2. Sample RepresentativenessA3. False Negative CorrectionA4. Raw (Unnormalized) Pass RatesA5. Raw (Unnormalized) Progress-Based Pass RatesA6. SOTA Models OnlyA7. Results by RepositoryA8. Ordering EffectsA9. Time Horizon Analysis

Sort: