MLflow now integrates TruLens scorers, bringing the Agent GPA (Goal-Plan-Action) framework to agent trace evaluation via mlflow.genai.evaluate(). The integration adds 10 scorers that analyze the full span tree of an agent's execution—covering plan quality, tool selection, plan adherence, tool calling validity, logical consistency, and execution efficiency—rather than just evaluating final outputs. On the TRAIL benchmark, GPA judges identify 95% of human-labeled agent errors versus 55% for baseline trace-aware judges. The scorers can be mixed with RAG metrics and other third-party frameworks in a single evaluation call, and support multiple LLM providers via LiteLLM.
Table of contents
The Agent GPA Framework How Trace Evaluation Catches What Output Evaluation Misses Combining Agent and RAG Evaluation Getting Started Resources Provenance 1 Comment
Sort: