Accelerate development of trustworthy AI agents at scale with MLflow's new LLM-as-a-judge features: Tunable Judges, Agent-as-a-Judge, and Judge Builder

databricks

MLflow introduces three new capabilities for evaluating AI agents: Tunable Judges for creating custom LLM evaluators using natural language instructions, Agent-as-a-Judge for automatically identifying relevant trace data without manual parsing, and Judge Builder for visual judge management with domain expert feedback. These tools enable teams to build domain-specific evaluation criteria, align judges with human feedback through continuous tuning, and scale quality assessment from prototype to production. The make_judge SDK simplifies creating custom judges, while alignment optimization incorporates subject matter expert feedback to improve evaluation accuracy over time.

Building Custom LLM Judges for AI Agent Accuracy

<p>Anyone developing AI agents might find this to be a huge time-saver. It seems much more efficient to use natural language to create custom judges rather than delving deeply into complicated code for evaluations. It’s definitely worth a try to see if the Judge Builder can help developers and domain experts collaborate more efficiently.</p>


<p>Anyone tried this and has some feedback. Looking to implement this, but it seems like a bit of a hastle to setup and want to see if there are some feedback</p>