From x.com
swyx's profile

swyx @swyx

Big news today if you're into coding evals: SWE-Bench Verified is dead!! https://t.co/TKdjV4yc9U i'm not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the evals they realized that at LEAST 16.4% of SWE-Bench Verified should technically be unsolvable... ... and also that ALL frontier models, including OpenAI's own, are capable of solving them by sheer contamination (including being able to recite verbatim the entire SWE-Bench problem setup and solution, just by giving Task ID alone (!!!!)). Heroic work from the OAI Evals team, and imo an important highlight on the fragility and messiness of Evals work in general. OpenAI spent the money to do 3 independent reviews of each problem in 2024 and AT LEAST SIXTEEN PERCENT OF THESE were still egregiously prolematic (as shown in screenshots). in this 2026 audit they then did 6 independent reviews from software engineers, with ADDITIONAL positive finding verification from a separate team, in order to arrive at today's conclusion. If this happens to SWE-Bench Verified... what else is hiding in other benchmarks out there?

Sort: