How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.

mlflow

A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — against those traces to verify correct behavior. When judges fail, the failing trace and judge rationale are fed back to Claude Code, which edits the skill file to fix the gap. Two real examples from the agent-evaluation skill illustrate how this loop caught Claude bypassing MLflow APIs entirely and a missing skill dependency in the description field. Key lessons: write judges before polishing the skill, use both judge types, and always include rationale in judge feedback to enable targeted automated fixes.

Testing and Refining Claude Code Skills with MLflow

Example: Testing and Improving a Claude Code Skill with MLflow ​