SWE-CI is a new repository-level benchmark designed to evaluate LLM-powered agents on long-term codebase maintainability rather than static, one-shot bug fixing. Unlike SWE-bench, which tests short-term functional correctness, SWE-CI uses a Continuous Integration loop to simulate real-world software evolution. The benchmark

2m read timeFrom arxiv.org
Post cover image

Sort: