Using LLMs as drop-in security scanners introduces six structural failure modes: run drift (nondeterministic outputs across identical runs), phantom findings (hallucinated vulnerabilities degrading precision), latency cliffs (agent loops are far slower than deterministic SAST), unbounded token costs without caching, exploitable 'correct' code as evidenced by the BaxBench benchmark (roughly 50% of functionally correct LLM-generated programs remain exploitable), and findings variance across model/prompt/harness combinations. The post recommends treating LLM security review as a hypothesis-generation assistant rather than an authoritative gate, pairing it with deterministic SAST for every commit, requiring proof artifacts for any LLM-only finding, and versioning prompts and model IDs like code dependencies. Two key metrics are proposed: run drift score (Jaccard similarity of findings across repeated runs at the same SHA) and proof latency (time from finding to executable exploit scaffold).
Table of contents
Background and prior artHow it works: the anatomy of an agentic security pass1. Run drift: the same command, a different report2. Phantom findings and the precision–recall trap3. The latency cliff: agent loops versus deterministic scanners4. Unbounded cost: no baseline, no delta, every run bills5. BaxBench and why “correct enough” code still gets exploited6. Findings variance: models, prompts, harnesses, and modes all moveTrade-offs and alternativesValidation and measurementFAQNext stepsSort: