SkillsBench is a new benchmark with 86 tasks across 11 domains designed to measure whether Agent Skills—structured procedural knowledge packages—actually improve LLM agent performance. Testing 7 agent-model configurations over 7,308 trajectories reveals that curated Skills boost average pass rates by 16.2 percentage points, though results vary dramatically by domain (from +4.5pp in Software Engineering to +51.9pp in Healthcare). Self-generated Skills provide no benefit on average, indicating models cannot reliably create the procedural knowledge they benefit from using. Focused Skills with 2-3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

2m read timeFrom arxiv.org
Post cover image

Sort: