MLflow 3.10 introduces multi-turn evaluation and conversation simulation for chatbots and AI agents. The release adds built-in session-level scorers like ConversationCompleteness and UserFrustration that assess entire conversations rather than individual responses. A ConversationSimulator lets developers define persona-based test scenarios with goals and guidelines, generate reproducible multi-turn conversations, and automatically extract test cases from production traces. Scorers can run on-demand against existing sessions or be registered to evaluate new sessions automatically. The workflow enables A/B comparison of agent versions—demonstrated by a prompt improvement that boosted completeness 50% and cut frustration 75%.
Table of contents
What is User Simulation for Multi-turn Conversations? The Setup Scoring Existing Sessions Scaling Multi-turn Agent Evaluation with Simulation What's Next Resources and References Sort: