Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Grafana Labs has open sourced o11y-bench, a benchmark for evaluating AI agents on observability workflows. Built on the Harbor framework, it runs agents against a real Grafana stack with synthetic metrics, logs, and traces, then grades them on 63 tasks spanning PromQL queries, LogQL, TraceQL, multi-step incident investigations, and dashboard editing. The benchmark uses two headline metrics — Pass^3 (consistency across three runs) and Pass@3 (best-of-three success) — prioritizing reliability over one-off success. Initial results across 29 model variants showed Claude Opus 4.7 (reasoning off) leading on consistency, with Qwen 3.6 Plus as the top open-source model. Dashboarding tasks proved hardest due to combined state, query correctness, and variable wiring requirements. The project is open source and accepts community contributions to its HuggingFace leaderboard.

#ai-agents

#observability

#grafana

Apr 21•9m read time•From grafana.com

Table of contents

What tasks o11y-bench tests Why verifying observability work is hard Measuring reliability vs. best-of-three success

Comment

Bookmark

Copy

Sort: