Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing technique to automatically generate meaning-preserving variants of disallowed requests and measure LLM guardrail fragility. Testing four models (one closed-source, two open-source pretrained, one open-source content filter) against weapon-related keywords
Table of contents
Executive SummaryBackgroundPrerequisite KnowledgeFuzzing AlgorithmsExperiment ResultsRealism of This Evasion MethodConclusionSort: