Abstract page for arXiv paper 2602.12670: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

SkillsBench is a new benchmark with 86 tasks across 11 domains designed to measure whether Agent Skills—structured procedural knowledge packages—actually improve LLM agent performance. Testing 7 agent-model configurations over 7,308 trajectories reveals that curated Skills boost average pass rates by 16.2 percentage points, though results vary dramatically by domain (from +4.5pp in Software Engineering to +51.9pp in Healthcare). Self-generated Skills provide no benefit on average, indicating models cannot reliably create the procedural knowledge they benefit from using. Focused Skills with 2-3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

[2602.12670] SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks