Improve your agents using MLflow's extensive, industry-leading suite of high-quality LLM judges.

mlflow

MLflow 3.9.0 integrates over 50 evaluation metrics from DeepEval, RAGAS, and Arize Phoenix frameworks into a unified API. This integration enables developers to evaluate LLM agents and RAG systems using multiple judge frameworks simultaneously, compare results side-by-side in MLflow UI, and access specialized metrics for conversational agents, retrieval quality, hallucination detection, and safety. The unified interface eliminates the need for custom wrappers and provides visualization, filtering, and iteration tools for improving agent quality before production deployment.

Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow

What are DeepEval, RAGAS, and Phoenix? ​

How can I use DeepEval, RAGAS, and Phoenix in MLflow? ​

Example: Evaluate Multi-turn Conversations with DeepEval ​