See how skillgym tests agent skills across runners, validates behavior, and catches regressions before your users do.

Callstack Blog

Skillgym is an open-source testing framework from Callstack that verifies AI agent skills (SKILL.md files) still behave correctly after every edit. It runs real agent sessions against CLI runners like Claude Code, Codex, OpenCode, and Cursor Agent, then lets you write TypeScript assertions against the results — checking which skills were used, which files were read, command ordering, final output, and token usage. It supports isolated workspaces per test case and cross-runner matrices so regressions on one model don't go undetected. The tool addresses the core problem of silent behavioral regressions when refining agent skill definitions.

How Skillgym Helps You Verify Agent Skills Still Work After Every Change

The questions a quick check will not answer