Peter Steinberger 🦞 reposted
openaidevs's profile

OpenAI Developers @openaidevs

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. https://t.co/3GeAsnUHdC

Sort: