IBM Research launches the Open Agent Leaderboard, an open benchmark that evaluates full AI agent systems rather than just the underlying models. It combines six established benchmarks (SWE-Bench Verified, BrowseComp+, AppWorld, tau2-Bench variants) under a unified protocol and reports both quality and cost per task. Key findings: agent architecture already meaningfully impacts results beyond model choice, general-purpose agents are competitive with specialized ones, and failed runs cost 20–54% more than successful ones. The accompanying Exgentic framework lets anyone reproduce or submit evaluations, and everything is open-sourced from day one.

9m read timeFrom huggingface.co
Post cover image
Table of contents
Can we measure generality?What we builtHow to read the leaderboardWhat we're already learningWhat's public todayWhat we want from the communityWhat's nextClosingRelated reading

Sort: