ITBench-AA is a new benchmark for evaluating AI agents on enterprise IT/SRE tasks, developed by Artificial Analysis and IBM. All frontier models score below 50%, with Claude Opus 4.7 leading at 47%, followed by GPT-5.5 at 46%. The benchmark consists of 59 SRE tasks requiring agents to diagnose Kubernetes incidents by analyzing alerts, logs, traces, and topology data. A key finding is that more investigation turns do not improve accuracy — models with longer trajectories (e.g., Gemini 3.1 Pro Preview at 83 turns) score lower than terser models. Open-weight models offer competitive cost-performance tradeoffs, with Gemma 4 31B scoring 37% at only $0.14 per task versus proprietary models costing up to $5.38 per task.

4m read timeFrom huggingface.co
Post cover image
Table of contents
Key findings:ITBench-AA SRE overview:Highlights

Sort: