A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — against those traces to verify correct behavior. When judges fail, the failing trace and judge rationale are fed back to Claude Code, which edits the skill file to fix the gap. Two real examples from the agent-evaluation skill illustrate how this loop caught Claude bypassing MLflow APIs entirely and a missing skill dependency in the description field. Key lessons: write judges before polishing the skill, use both judge types, and always include rationale in judge feedback to enable targeted automated fixes.

10m read timeFrom mlflow.org
Post cover image
Table of contents
The Skill Testing Problem ​What Is a Claude Code Skill? ​Example: Testing and Improving a Claude Code Skill with MLflow ​The Automated Refinement Loop ​What We Learned ​Get Started ​

Sort: