Skillgym is an open-source testing framework from Callstack that verifies AI agent skills (SKILL.md files) still behave correctly after every edit. It runs real agent sessions against CLI runners like Claude Code, Codex, OpenCode, and Cursor Agent, then lets you write TypeScript assertions against the results — checking which skills were used, which files were read, command ordering, final output, and token usage. It supports isolated workspaces per test case and cross-runner matrices so regressions on one model don't go undetected. The tool addresses the core problem of silent behavioral regressions when refining agent skill definitions.

7m read timeFrom callstack.com
Post cover image
Table of contents
The questions a quick check will not answerWhat Skillgym doesWhat you can checkOne skill, every agentGiving each run a clean slateWhen a skill edit gets expensiveGetting startedFinal Words

Sort: