coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

Databricks engineers describe coSTAR, a methodology for testing and iteratively improving AI agents using MLflow. The framework draws a direct analogy to traditional software testing: scenario definitions act as test fixtures, MLflow traces serve as the test harness, LLM-based judges replace assertions, and a coding assistant drives the refine loop. A key innovation is the dual-loop design — one loop aligns judges against human expert golden sets, and a second loop uses those trusted judges to automatically refine the agent. The post covers agentic judges (judges that use tools to inspect traces selectively rather than consuming full traces), deterministic checks, operational metrics, regression testing for MCP tool dependencies, and reusing the same judges for production monitoring. Known limitations include manual scenario generation, judge alignment cost, overfitting risk, and multi-step failure attribution.

#machine-learning

#ai-agents

#databricks

Mar 20•19m read time•From databricks.com

Table of contents

The Problem: Coding Without Tests The Analogy That Guides Our Approach S - Scenario Definitions T - Trace Capture A - Assess with Judges R - Refine Regression Tests for Infrastructure, Not Just the Agent From Eval to Production Monitoring Where We Are Now What Doesn't Work (Yet)Key Takeaways

Comment

Bookmark

Copy

Sort: