Many SWE-bench-Passing PRs Would Not Be Merged into Main

METR researchers had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests to assess whether benchmark scores translate to real-world usefulness. The key finding: maintainer merge rates are on average 24 percentage points lower than SWE-bench automated grader scores, meaning roughly half of test-passing PRs would not be merged. After normalizing for noise in maintainer decisions using a golden baseline of human-written PRs, AI agents score about 50% of what the benchmark suggests. Rejection reasons include code quality issues, breaking unrelated code, and core functionality failures. The study also finds the rate of improvement in maintainer-assessed quality may be slower than benchmark scores suggest, though this finding is weaker. Importantly, the authors note this is not a fundamental capability ceiling — agents were not given the chance to iterate based on feedback as human developers would. The conclusion is that SWE-bench scores should be treated as one signal among many, not a direct proxy for real-world developer productivity.

#llm

#code-review

Mar 12•18m read time•From metr.org

Table of contents

Introduction Data and Methods Results Conclusion A1. Conditional Maintainer Merge Rate A2. Sample Representativeness A3. False Negative Correction A4. Raw (Unnormalized) Pass Rates A5. Raw (Unnormalized) Progress-Based Pass Rates A6. SOTA Models Only A7. Results by Repository A8. Ordering Effects A9. Time Horizon Analysis

Comment

Bookmark

Copy

Sort: