MLflow 3.9.0 integrates over 50 evaluation metrics from DeepEval, RAGAS, and Arize Phoenix frameworks into a unified API. This integration enables developers to evaluate LLM agents and RAG systems using multiple judge frameworks simultaneously, compare results side-by-side in MLflow UI, and access specialized metrics for conversational agents, retrieval quality, hallucination detection, and safety. The unified interface eliminates the need for custom wrappers and provides visualization, filtering, and iteration tools for improving agent quality before production deployment.

6m read timeFrom mlflow.org
Post cover image
Table of contents
Challenges ​What are DeepEval, RAGAS, and Phoenix? ​How can I use DeepEval, RAGAS, and Phoenix in MLflow? ​What Judge Should I Choose? ​Example: Evaluate Multi-turn Conversations with DeepEval ​What's Next ​Resources and References ​

Sort: