Skill authors now have tools to verify their skills work, catch regressions, and improve descriptions—no coding required.

Claude

Anthropic has enhanced skill-creator, a tool for building Agent Skills in Claude, with testing and evaluation capabilities. Authors can now write evals to verify skill behavior, run benchmarks tracking pass rate, time, and token usage, and use multi-agent support to run evals in parallel without context bleed. A comparator agent enables A/B testing between skill versions. The update also adds description tuning to improve skill triggering accuracy, reducing false positives and negatives. Two skill types are distinguished: capability uplift skills (teaching Claude new behaviors) and encoded preference skills (sequencing existing capabilities per team workflows), each benefiting from evals differently. The framework is available on Claude.ai, Cowork, and as a Claude Code plugin.

Improving skill-creator: Test, measure, and refine Agent Skills

Faster, more consistent evaluation with multi-agent support

Getting skills to trigger at the right time

<p>Nice improvement. Treating agent skills like something you test, measure, and refine feels like the right direction for making AI systems more reliable.</p>