Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing technique to automatically generate meaning-preserving variants of disallowed requests and measure LLM guardrail fragility. Testing four models (one closed-source, two open-source pretrained, one open-source content filter) against weapon-related keywords revealed evasion rates ranging from 1% to 99%. Key findings: open vs. closed source licensing is not a reliable indicator of guardrail strength; robustness is keyword-dependent with large variance; a standalone content filter model was the most brittle, classifying 97–99% of fuzzed prompts as benign. The research argues that even low evasion rates become operationally significant when attackers automate at scale. Recommendations include treating LLMs as non-security boundaries, applying layered controls, isolating untrusted input, validating outputs, and continuously running adversarial fuzzing as regression testing.

#data-science

#genai

#prompt-injection

Mar 17•17m read time•From unit42.paloaltonetworks.com

Table of contents

Executive Summary Background Prerequisite Knowledge Fuzzing Algorithms Experiment Results Realism of This Evasion Method Conclusion

Comment

Bookmark

Copy

Sort: