Abstract page for arXiv paper 2603.03823: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

SWE-CI is a new repository-level benchmark designed to evaluate LLM-powered agents on long-term codebase maintainability rather than static, one-shot bug fixing. Unlike SWE-bench, which tests short-term functional correctness, SWE-CI uses a Continuous Integration loop to simulate real-world software evolution. The benchmark includes 100 tasks, each covering an average of 233 days and 71 consecutive commits from real repositories, requiring agents to perform dozens of iterative analysis and coding rounds. The goal is to shift evaluation focus from isolated fixes toward sustained code quality over time.

[2603.03823] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration