A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — to verify correct behavior. When judges fail, the failing trace and rationale are fed back to Claude Code, which edits the skill file itself. Two real examples from the agent-evaluation skill illustrate how this loop caught Claude bypassing MLflow APIs entirely and a missing skill dependency in the description field. Key lessons: write judges before polishing the skill, use both judge types, and ensure judge rationale is detailed enough for automated refinement.

10m read timeFrom mlflow.org
Post cover image
Table of contents
The Skill Testing Problem ​What Is a Claude Code Skill? ​Example: Testing and Improving a Claude Code Skill with MLflow ​The Automated Refinement Loop ​What We Learned ​Get Started ​

Sort: