SWE-CI is a new repository-level benchmark designed to evaluate LLM-powered agents on long-term codebase maintainability rather than static, one-shot bug fixing. Unlike SWE-bench, which tests short-term functional correctness, SWE-CI uses a Continuous Integration loop to simulate real-world software evolution. The benchmark
Sort: