Grafana Labs has open sourced o11y-bench, a benchmark for evaluating AI agents on observability workflows. Built on the Harbor framework, it runs agents against a real Grafana stack with synthetic metrics, logs, and traces, then grades them on 63 tasks spanning PromQL queries, LogQL, TraceQL, multi-step incident investigations, and dashboard editing. The benchmark uses two headline metrics — Pass^3 (consistency across three runs) and Pass@3 (best-of-three success) — prioritizing reliability over one-off success. Initial results across 29 model variants showed Claude Opus 4.7 (reasoning off) leading on consistency, with Qwen 3.6 Plus as the top open-source model. Dashboarding tasks proved hardest due to combined state, query correctness, and variable wiring requirements. The project is open source and accepts community contributions to its HuggingFace leaderboard.
Table of contents
What tasks o11y-bench testsWhy verifying observability work is hardMeasuring reliability vs. best-of-three successSort: