Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().

mlflow

MLflow now integrates TruLens scorers, bringing the Agent GPA (Goal-Plan-Action) framework to agent trace evaluation via mlflow.genai.evaluate(). The integration adds 10 scorers that analyze the full span tree of an agent's execution—covering plan quality, tool selection, plan adherence, tool calling validity, logical consistency, and execution efficiency—rather than just evaluating final outputs. On the TRAIL benchmark, GPA judges identify 95% of human-labeled agent errors versus 55% for baseline trace-aware judges. The scorers can be mixed with RAG metrics and other third-party frameworks in a single evaluation call, and support multiple LLM providers via LiteLLM.

Agent Trace Evaluation with TruLens Scorers in MLflow

How Trace Evaluation Catches What Output Evaluation Misses ​

<p>Interesting perspective. A lot of AI workflows fail not because the models are weak, but because the process around them is poorly designed. Turning messy, iterative problem-solving into reusable “skills” after the work is done feels much closer to how real engineering and learning actually happen. It also makes the knowledge easier to reuse across teams instead of being locked inside one project.</p>
<p>Curious to hear how others approach this, do you prefer defining reusable workflows upfront, or extracting them after solving the problem once?</p>