OfficeQA is Databricks’ new open benchmark for grounded reasoning on real-world enterprise data, built on U.S. Treasury Bulletins to test modern AI agents.

databricks

OfficeQA is a new open-source benchmark designed to evaluate AI agents on grounded reasoning tasks using real-world enterprise data. Built on 89,000 pages of U.S. Treasury Bulletins, it tests agents' ability to extract information from complex documents, perform retrieval across large corpora, and answer questions requiring analytical reasoning. Current frontier models (GPT-5.1, Claude Opus 4.5) achieve less than 45% accuracy on the full benchmark and under 25% on the hardest questions, even with advanced document parsing. The benchmark includes 246 questions requiring high precision, with human solvers averaging 50 minutes per question. Databricks is launching the Grounded Reasoning Cup in Spring 2026, where AI agents will compete against human teams.

Introducing OfficeQA: A Benchmark for End-to-End Grounded Reasoning

Baseline Agents: Implementation and Performance