Big news today if you're into coding evals: SWE-Bench Verified is dead!! https://t.co/TKdjV4yc9U i'm not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the evals they realized that at LEAST 16.4% of SWE-Bench Verified should technically be unsolvable... ... and also that ALL frontier models, including OpenAI's own, are capable of solving them by sheer contamination (including being able to recite verbatim the entire SWE-Bench problem setup and solution, just by giving Task ID alone (!!!!)). Heroic work from the OAI Evals team, and imo an important highlight on the fragility and messiness of Evals work in general. OpenAI spent the money to do 3 independent reviews of each problem in 2024 and AT LEAST SIXTEEN PERCENT OF THESE were still egregiously prolematic (as shown in screenshots). in this 2026 audit they then did 6 independent reviews from software engineers, with ADDITIONAL positive finding verification from a separate team, in order to arrive at today's conclusion. If this happens to SWE-Bench Verified... what else is hiding in other benchmarks out there?

Latent.Space @latentspacepod
🆕 The End of SWE-Bench Verified (2024-2026) https://t.co/c8rSvGyNuI Today @OpenAIDevs is announcing the voluntary deprecation of SWE-Bench Verified! We're releasing a podcast + analysis in today's post. Saturation of SWE-Bench has been a community hot topic for over a year - @jyangballin and @OfirPress argue that there is still room to grow - 87.5-95% is the theoretical "ceiling". But new analysis from OpenAI has identified enough problems with their remaining unsolved tasks that it is no longer worth pursuing or publicizing SWE-Bench Verified numbers. The most egregious is contamination - every single frontier model, including OpenAI's own - now demonstrates ability to regurgitate SWE-Bench eval data and solutions, sometimes from as little as just the Task ID: The other is simply just bad tests! at least 60% of remaining unsolved problems should be unsolvable given their problem description... and if you can solve them you are probably cheating. for example, SWE-Bench's test for pylint issue #4551: Massive kudos to OpenAI for leading the way in both initiating and then sunsetting SWE-Bench Verified. End of an Era!
Sort: