A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — against those traces to verify correct behavior. When judges fail, the
Table of contents
The Skill Testing Problem What Is a Claude Code Skill? Example: Testing and Improving a Claude Code Skill with MLflow The Automated Refinement Loop What We Learned Get Started Sort: