MLflow 3.10 introduces multi-turn evaluation and conversation simulation so you can score entire conversations, test agent changes with reproducible scenarios, and catch failures that only surface across turns.

mlflow

MLflow 3.10 introduces multi-turn evaluation and conversation simulation for chatbots and AI agents. The release adds built-in session-level scorers like ConversationCompleteness and UserFrustration that assess entire conversations rather than individual responses. A ConversationSimulator lets developers define persona-based test scenarios with goals and guidelines, generate reproducible multi-turn conversations, and automatically extract test cases from production traces. Scorers can run on-demand against existing sessions or be registered to evaluate new sessions automatically. The workflow enables A/B comparison of agent versions—demonstrated by a prompt improvement that boosted completeness 50% and cut frustration 75%.

Multi-turn Evaluation & Simulation: Enhancing AI Observability with MLflow for Chatbots

What is User Simulation for Multi-turn Conversations? ​

Scaling Multi-turn Agent Evaluation with Simulation ​