OpenAI Developers

The standard for evaluating frontier coding models is shifting. SWE-bench Verified, previously a widely used benchmark, is now considered saturated due to test-design flaws and contamination from public repositories. The new recommendation is to report results on SWE-bench Pro, a more rigorous benchmark intended to better reflect real-world coding capability as models continue to mature.

The standard for frontier coding evals is changing with model maturity.

We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards.

SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories.
https://t.co/3GeAsnUHdC

<p>The standard for frontier coding evals is changing with model maturity.

We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards.

SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories.
https://t.co/3GeAsnUHdC</p>

Peter Steinberger 🦞

OpenAI is updating its recommended standard for frontier coding evaluations as models mature, shifting to SWE-bench Pro as the new benchmark for assessing coding model performance.