SWE-CI is a new repository-level benchmark designed to evaluate LLM-powered agents on long-term codebase maintainability rather than static, one-shot bug fixing. Unlike SWE-bench, which tests short-term functional correctness, SWE-CI uses a Continuous Integration loop to simulate real-world software evolution. The benchmark includes 100 tasks, each covering an average of 233 days and 71 consecutive commits from real repositories, requiring agents to perform dozens of iterative analysis and coding rounds. The goal is to shift evaluation focus from isolated fixes toward sustained code quality over time.

2m read timeFrom arxiv.org
Post cover image

Sort: