Databricks engineers describe coSTAR, a methodology for testing and iteratively improving AI agents using MLflow. The framework draws a direct analogy to traditional software testing: scenario definitions act as test fixtures, MLflow traces serve as the test harness, LLM-based judges replace assertions, and a coding assistant drives the refine loop. A key innovation is the dual-loop design — one loop aligns judges against human expert golden sets, and a second loop uses those trusted judges to automatically refine the agent. The post covers agentic judges (judges that use tools to inspect traces selectively rather than consuming full traces), deterministic checks, operational metrics, regression testing for MCP tool dependencies, and reusing the same judges for production monitoring. Known limitations include manual scenario generation, judge alignment cost, overfitting risk, and multi-step failure attribution.

19m read timeFrom databricks.com
Post cover image
Table of contents
The Problem: Coding Without TestsThe Analogy That Guides Our ApproachS - Scenario DefinitionsT - Trace CaptureA - Assess with JudgesR - RefineRegression Tests for Infrastructure, Not Just the AgentFrom Eval to Production MonitoringWhere We Are NowWhat Doesn't Work (Yet)Key Takeaways

Sort: