A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — to verify correct behavior. When judges fail, the failing trace and rationale are fed back to Claude Code, which edits the skill file itself. Two real examples from the agent-evaluation skill illustrate how this loop caught Claude bypassing MLflow APIs entirely and a missing skill dependency in the description field. Key lessons: write judges before polishing the skill, use both judge types, and ensure judge rationale is detailed enough for automated refinement.
Table of contents
The Skill Testing Problem What Is a Claude Code Skill? Example: Testing and Improving a Claude Code Skill with MLflow The Automated Refinement Loop What We Learned Get Started Sort: