LLM Security Automation Isn’t a Drop-In Scanner Yet

Using LLMs as drop-in security scanners introduces six structural failure modes: run drift (nondeterministic outputs across identical runs), phantom findings (hallucinated vulnerabilities degrading precision), latency cliffs (agent loops are far slower than deterministic SAST), unbounded token costs without caching, exploitable 'correct' code as evidenced by the BaxBench benchmark (roughly 50% of functionally correct LLM-generated programs remain exploitable), and findings variance across model/prompt/harness combinations. The post recommends treating LLM security review as a hypothesis-generation assistant rather than an authoritative gate, pairing it with deterministic SAST for every commit, requiring proof artifacts for any LLM-only finding, and versioning prompts and model IDs like code dependencies. Two key metrics are proposed: run drift score (Jaccard similarity of findings across repeated runs at the same SHA) and proof latency (time from finding to executable exploit scaffold).

#security

May 03•12m read time•From lirantal.com

Table of contents

Background and prior art How it works: the anatomy of an agentic security pass 1. Run drift: the same command, a different report 2. Phantom findings and the precision–recall trap 3. The latency cliff: agent loops versus deterministic scanners 4. Unbounded cost: no baseline, no delta, every run bills 5. BaxBench and why “correct enough” code still gets exploited 6. Findings variance: models, prompts, harnesses, and modes all move Trade-offs and alternatives Validation and measurement FAQ Next steps

Comment

Bookmark

Copy

Sort: