A new AI research benchmark called SWECI tests coding agents on long-term software maintenance rather than one-off patches. Unlike static benchmarks, SWECI places agents in a continuous integration loop with 100 real-world tasks spanning ~233 days and ~71 code changes each. The key metric, the 'evil score,' rewards code that makes future changes easier and avoids technical debt. Results show even the strongest models struggle, with zero-regression rates below 25%, meaning they frequently break previously working code. The findings suggest long-term software maintenance remains an unsolved frontier for AI coding agents.
•1m watch time
Sort: