Introducing ARFBench: A time series question-answering benchmark based on real incidents

ARFBench is a new time series question-answering benchmark derived from real production incidents at Datadog, consisting of 750 QA pairs across 63 incidents. It evaluates how well LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) can perform observability tasks like anomaly identification and root cause localization. Key findings: GPT-5 leads existing models at 62.7% accuracy but still underperforms human domain experts; a new hybrid model combining Datadog's Toto TSFM with Qwen3-VL achieves comparable performance to frontier models with far fewer parameters; and a model-expert oracle combining AI and human judgment reaches 87.2% accuracy, establishing a new superhuman frontier. The benchmark and model weights are publicly available on Hugging Face.

#machine-learning

#data-science

#llm

Apr 27•7m read time•From blog.ml.cmu.edu

Table of contents

ARFBench: Using real-world incident data to create a TSQA benchmark Leading LLMs, VLMs, and TSFMs have substantial room for improvement Hybrid TSFM-VLM models show promise for specialized TSQA modeling Domain experts complemented with models set a new superhuman frontier What’s next: time series reasoning as a core component of agents Getting started with ARFBench

Comment

Bookmark

Copy

Sort: