Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing technique to automatically generate meaning-preserving variants of disallowed requests and measure LLM guardrail fragility. Testing four models (one closed-source, two open-source pretrained, one open-source content filter) against weapon-related keywords

17m read timeFrom unit42.paloaltonetworks.com
Post cover image
Table of contents
Executive SummaryBackgroundPrerequisite KnowledgeFuzzing AlgorithmsExperiment ResultsRealism of This Evasion MethodConclusion

Sort: