Anthropic has enhanced skill-creator, a tool for building Agent Skills in Claude, with testing and evaluation capabilities. Authors can now write evals to verify skill behavior, run benchmarks tracking pass rate, time, and token usage, and use multi-agent support to run evals in parallel without context bleed. A comparator agent enables A/B testing between skill versions. The update also adds description tuning to improve skill triggering accuracy, reducing false positives and negatives. Two skill types are distinguished: capability uplift skills (teaching Claude new behaviors) and encoded preference skills (sequencing existing capabilities per team workflows), each benefiting from evals differently. The framework is available on Claude.ai, Cowork, and as a Claude Code plugin.
Table of contents
Two kinds of skillsUsing evals to test and improve skillsFaster, more consistent evaluation with multi-agent supportGetting skills to trigger at the right timeLooking aheadGetting Started3 Comments
Sort: