OfficeQA is a new open-source benchmark designed to evaluate AI agents on grounded reasoning tasks using real-world enterprise data. Built on 89,000 pages of U.S. Treasury Bulletins, it tests agents' ability to extract information from complex documents, perform retrieval across large corpora, and answer questions requiring analytical reasoning. Current frontier models (GPT-5.1, Claude Opus 4.5) achieve less than 45% accuracy on the full benchmark and under 25% on the hardest questions, even with advanced document parsing. The benchmark includes 246 questions requiring high precision, with human solvers averaging 50 minutes per question. Databricks is launching the Grounded Reasoning Cup in Spring 2026, where AI agents will compete against human teams.

14m read timeFrom databricks.com
Post cover image
Table of contents
Dataset DesiderataIntroducing the OfficeQA BenchmarkBaseline Agents: Implementation and PerformanceFailure ModesDatabricks Grounded Reasoning CupConclusion

Sort: