How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.

mlflow

A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — to verify correct behavior. When judges fail, the failing trace and rationale are fed back to Claude Code, which edits the skill file itself. Two real examples from the agent-evaluation skill illustrate how this loop caught Claude bypassing MLflow APIs entirely and a missing skill dependency in the description field. Key lessons: write judges before polishing the skill, use both judge types, and ensure judge rationale is detailed enough for automated refinement.

Testing and Refining Claude Code Skills with MLflow

Example: Testing and Improving a Claude Code Skill with MLflow ​