MLflow 3.9.0 integrates over 50 evaluation metrics from DeepEval, RAGAS, and Arize Phoenix frameworks into a unified API. This integration enables developers to evaluate LLM agents and RAG systems using multiple judge frameworks simultaneously, compare results side-by-side in MLflow UI, and access specialized metrics for
Table of contents
Challenges What are DeepEval, RAGAS, and Phoenix? How can I use DeepEval, RAGAS, and Phoenix in MLflow? What Judge Should I Choose? Example: Evaluate Multi-turn Conversations with DeepEval What's Next Resources and References Sort: