IBM Research introduces VAKRA, an executable benchmark for evaluating AI agent performance in enterprise-like environments. VAKRA tests agents across four capabilities: API chaining with business intelligence APIs, tool selection from large dashboard API sets, multi-hop reasoning, and multi-hop multi-source reasoning with policy adherence. The benchmark uses 8,000+ locally hosted APIs across 62 domains and evaluates full execution trajectories rather than just final answers. Error analysis reveals that current models—including GPT-OSS-120B, Gemini-3-flash-preview, and Claude-Sonnet-4-5—struggle significantly with compositional reasoning, correct argument naming, tool selection from large sets, and following tool-use policies. Performance degrades sharply as hop depth increases, and models often violate constraints or skip tool calls when answers seem retrievable from parametric knowledge.
Table of contents
Task DescriptionEvaluation FrameworkError AnalysisConclusionTry VAKRA — Where Does Your Agent Break?Sort: