Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing technique to automatically generate meaning-preserving variants of disallowed requests and measure LLM guardrail fragility. Testing four models (one closed-source, two open-source pretrained, one open-source content filter) against weapon-related keywords revealed evasion rates ranging from 1% to 99%. Key findings: open vs. closed source licensing is not a reliable indicator of guardrail strength; robustness is keyword-dependent with large variance; a standalone content filter model was the most brittle, classifying 97–99% of fuzzed prompts as benign. The research argues that even low evasion rates become operationally significant when attackers automate at scale. Recommendations include treating LLMs as non-security boundaries, applying layered controls, isolating untrusted input, validating outputs, and continuously running adversarial fuzzing as regression testing.
Table of contents
Executive SummaryBackgroundPrerequisite KnowledgeFuzzing AlgorithmsExperiment ResultsRealism of This Evasion MethodConclusionSort: