See how skillgym tests agent skills across runners, validates behavior, and catches regressions before your users do.

Callstack Blog

Skillgym is an open-source testing framework from Callstack for verifying AI agent skills (SKILL.md files) still behave correctly after edits. It runs real agent sessions against CLI runners like Claude Code, Codex, OpenCode, and Cursor Agent, then lets you write TypeScript assertions against the results — checking which skills were used, which files were read, command ordering, final output, and token usage. It supports isolated workspaces per test case and cross-runner matrix testing to catch regressions that only appear on specific agents. The tool addresses the problem of silent behavioral regressions when editing agent skill definitions, replacing manual prompt-and-hope workflows with reproducible, comparable test runs.

How Skillgym Helps You Verify Agent Skills Still Work

The questions a quick check will not answer