A structured benchmark comparing five AI-powered pentesting tools — Escape, Claude (Opus 4.6), Shannon, Strix, and PentAGI — against a deliberately vulnerable web app called Duck Store (20 known vulnerabilities, FastAPI + React stack). All tools ran under identical grey-box conditions with the same credentials and OpenAPI spec. Escape led with 75% detection (15/20), Claude followed at 70% (14/20), PentAGI at 45%, Shannon at 30%, and Strix at just 5%. A key finding: Shannon, Strix, and PentAGI all use DeepSeek v3.2 yet diverge dramatically in results, proving that the agent orchestration layer — not the underlying model — is the critical differentiator. Claude completed its scan in 10 minutes vs. 4–6 hours for specialized tools. The benchmark also tracks false positive rates and unique findings, and highlights that signal-to-noise ratio matters as much as raw detection rate.
Table of contents
TL;DRTarget application selected and whyVulnerability categories testedEvaluation metricsTest environment & conditionsReproducibility & fairness constraintsAI pentesting tool selection for benchmarkSetup & configurationResultsHighlightsFinding walkthroughsAnalysisLimitations of this benchmarkWhat security engineers should take awaySort: