MLflow 3.10 introduces multi-turn evaluation and conversation simulation for chatbots and AI agents. The release adds built-in session-level scorers like ConversationCompleteness and UserFrustration that assess entire conversations rather than individual responses. A ConversationSimulator lets developers define persona-based test scenarios with goals and guidelines, generate reproducible multi-turn conversations, and automatically extract test cases from production traces. Scorers can run on-demand against existing sessions or be registered to evaluate new sessions automatically. The workflow enables A/B comparison of agent versions—demonstrated by a prompt improvement that boosted completeness 50% and cut frustration 75%.

6m read timeFrom mlflow.org
Post cover image
Table of contents
What is User Simulation for Multi-turn Conversations? ​The Setup ​Scoring Existing Sessions ​Scaling Multi-turn Agent Evaluation with Simulation ​What's Next ​Resources and References ​

Sort: