Datadog's Cloud Cost Management team built a natural language query agent that converts plain-English questions into metrics queries. They reduced debugging time from hours to minutes by creating a ground truth dataset from real user testing, implementing component-level evaluators (parsing, metric selection, roll-up,
Table of contents
Building the ground truth dataset from user testingEvaluating AI that doesn’t fail cleanlyDeconstructing correctness with evaluatorsDebugging with agentic tracingAutomating scaled experimentation with every buildDebugging and iterating faster with evaluator-driven tracingBuild evaluation and tracing into your agent loopSort: