METR researchers had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests to assess whether benchmark scores translate to real-world usefulness. The key finding: maintainer merge rates are on average 24 percentage points lower than SWE-bench automated grader scores, meaning roughly half of test-passing PRs would not be merged. After normalizing for noise in maintainer decisions using a golden baseline of human-written PRs, AI agents score about 50% of what the benchmark suggests. Rejection reasons include code quality issues, breaking unrelated code, and core functionality failures. The study also finds the rate of improvement in maintainer-assessed quality may be slower than benchmark scores suggest, though this finding is weaker. Importantly, the authors note this is not a fundamental capability ceiling — agents were not given the chance to iterate based on feedback as human developers would. The conclusion is that SWE-bench scores should be treated as one signal among many, not a direct proxy for real-world developer productivity.
Table of contents
IntroductionData and MethodsResultsConclusionA1. Conditional Maintainer Merge RateA2. Sample RepresentativenessA3. False Negative CorrectionA4. Raw (Unnormalized) Pass RatesA5. Raw (Unnormalized) Progress-Based Pass RatesA6. SOTA Models OnlyA7. Results by RepositoryA8. Ordering EffectsA9. Time Horizon AnalysisSort: