Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

Unit 42 researchers developed AdvJudge-Zero, an automated fuzzer that exposes critical vulnerabilities in LLM-based AI judges used as security gatekeepers. Unlike prior adversarial attacks that produce detectable gibberish, this tool discovers stealthy trigger sequences using benign formatting symbols (markdown headers, newlines, role indicators) that exploit a model's attention mechanism to flip block decisions to allow. The fuzzer operates in black-box mode, using next-token distribution probing and logit-gap analysis to identify low-perplexity control tokens. Testing achieved a 99% bypass success rate across open-weight enterprise models, specialized reward models, and large 70B+ parameter models. Two key attack scenarios are outlined: bypassing safety filters to approve harmful content, and corrupting RLHF training data via reward hacking. The researchers propose adversarial training using the fuzzer's findings as a mitigation, potentially reducing attack success to near zero.

#llm

#ai-security

#prompt-injection

Mar 10•7m read time•From unit42.paloaltonetworks.com

Table of contents

Executive Summary Background The Methodology: Automated Predictive Fuzzing How Attacks Would Manifest in Real-World Scenarios Vulnerable Model Categories Conclusion Additional Resources

Comment

Bookmark

Copy

Sort: