A methodology for testing and automatically refining Claude Code skills (SKILL.md files) using MLflow tracing and LLM-based judges. The approach traces every tool call Claude makes during skill execution, then runs judges — both LLM-based and rule-based — against those traces to verify correct behavior. When judges fail, the

10m read timeFrom mlflow.org
Post cover image
Table of contents
The Skill Testing Problem ​What Is a Claude Code Skill? ​Example: Testing and Improving a Claude Code Skill with MLflow ​The Automated Refinement Loop ​What We Learned ​Get Started ​

Sort: