Skillgym is an open-source testing framework from Callstack for verifying AI agent skills (SKILL.md files) still behave correctly after edits. It runs real agent sessions against CLI runners like Claude Code, Codex, OpenCode, and Cursor Agent, then lets you write TypeScript assertions against the results — checking which skills were used, which files were read, command ordering, final output, and token usage. It supports isolated workspaces per test case and cross-runner matrix testing to catch regressions that only appear on specific agents. The tool addresses the problem of silent behavioral regressions when editing agent skill definitions, replacing manual prompt-and-hope workflows with reproducible, comparable test runs.

7m read timeFrom callstack.com
Post cover image
Table of contents
The questions a quick check will not answerWhat Skillgym doesWhat you can checkOne skill, every agentGiving each run a clean slateWhen a skill edit gets expensiveGetting startedFinal Words

Sort: