Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research introduces VAKRA, an executable benchmark for evaluating AI agent performance in enterprise-like environments. VAKRA tests agents across four capabilities: API chaining with business intelligence APIs, tool selection from large dashboard API sets, multi-hop reasoning, and multi-hop multi-source reasoning with policy adherence. The benchmark uses 8,000+ locally hosted APIs across 62 domains and evaluates full execution trajectories rather than just final answers. Error analysis reveals that current models—including GPT-OSS-120B, Gemini-3-flash-preview, and Claude-Sonnet-4-5—struggle significantly with compositional reasoning, correct argument naming, tool selection from large sets, and following tool-use policies. Performance degrades sharply as hop depth increases, and models often violate constraints or skip tool calls when answers seem retrievable from parametric knowledge.

#ai-agents

#rag

Apr 20•15m read time•From huggingface.co

Table of contents

Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break?

Comment

Bookmark

Copy

Sort: