IBM Research launches the Open Agent Leaderboard, an open benchmark that evaluates full AI agent systems rather than just the underlying models. It combines six established benchmarks (SWE-Bench Verified, BrowseComp+, AppWorld, tau2-Bench variants) under a unified protocol and reports both quality and cost per task. Key findings: agent architecture already meaningfully impacts results beyond model choice, general-purpose agents are competitive with specialized ones, and failed runs cost 20–54% more than successful ones. The accompanying Exgentic framework lets anyone reproduce or submit evaluations, and everything is open-sourced from day one.
Table of contents
Can we measure generality?What we builtHow to read the leaderboardWhat we're already learningWhat's public todayWhat we want from the communityWhat's nextClosingRelated readingSort: