MLflow 3.9.0 integrates over 50 evaluation metrics from DeepEval, RAGAS, and Arize Phoenix frameworks into a unified API. This integration enables developers to evaluate LLM agents and RAG systems using multiple judge frameworks simultaneously, compare results side-by-side in MLflow UI, and access specialized metrics for conversational agents, retrieval quality, hallucination detection, and safety. The unified interface eliminates the need for custom wrappers and provides visualization, filtering, and iteration tools for improving agent quality before production deployment.
Table of contents
Challenges What are DeepEval, RAGAS, and Phoenix? How can I use DeepEval, RAGAS, and Phoenix in MLflow? What Judge Should I Choose? Example: Evaluate Multi-turn Conversations with DeepEval What's Next Resources and References Sort: