ARFBench is a new time series question-answering benchmark derived from real production incidents at Datadog, consisting of 750 QA pairs across 63 incidents. It evaluates how well LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) can perform observability tasks like anomaly identification and root cause localization. Key findings: GPT-5 leads existing models at 62.7% accuracy but still underperforms human domain experts; a new hybrid model combining Datadog's Toto TSFM with Qwen3-VL achieves comparable performance to frontier models with far fewer parameters; and a model-expert oracle combining AI and human judgment reaches 87.2% accuracy, establishing a new superhuman frontier. The benchmark and model weights are publicly available on Hugging Face.
Table of contents
ARFBench: Using real-world incident data to create a TSQA benchmarkLeading LLMs, VLMs, and TSFMs have substantial room for improvementHybrid TSFM-VLM models show promise for specialized TSQA modelingDomain experts complemented with models set a new superhuman frontierWhat’s next: time series reasoning as a core component of agentsGetting started with ARFBenchSort: