Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing technique to automatically generate meaning-preserving variants of disallowed requests and measure LLM guardrail fragility. Testing four models (one closed-source, two open-source pretrained, one open-source content filter) against weapon-related keywords revealed evasion rates ranging from 1% to 99%. Key findings: open vs. closed source licensing is not a reliable indicator of guardrail strength; robustness is keyword-dependent with large variance; a standalone content filter model was the most brittle, classifying 97–99% of fuzzed prompts as benign. The research argues that even low evasion rates become operationally significant when attackers automate at scale. Recommendations include treating LLMs as non-security boundaries, applying layered controls, isolating untrusted input, validating outputs, and continuously running adversarial fuzzing as regression testing.

17m read timeFrom unit42.paloaltonetworks.com
Post cover image
Table of contents
Executive SummaryBackgroundPrerequisite KnowledgeFuzzing AlgorithmsExperiment ResultsRealism of This Evasion MethodConclusion

Sort: