Unit 42 researchers developed AdvJudge-Zero, an automated fuzzer that exposes critical vulnerabilities in LLM-based AI judges used as security gatekeepers. Unlike prior adversarial attacks that produce detectable gibberish, this tool discovers stealthy trigger sequences using benign formatting symbols (markdown headers, newlines, role indicators) that exploit a model's attention mechanism to flip block decisions to allow. The fuzzer operates in black-box mode, using next-token distribution probing and logit-gap analysis to identify low-perplexity control tokens. Testing achieved a 99% bypass success rate across open-weight enterprise models, specialized reward models, and large 70B+ parameter models. Two key attack scenarios are outlined: bypassing safety filters to approve harmful content, and corrupting RLHF training data via reward hacking. The researchers propose adversarial training using the fuzzer's findings as a mitigation, potentially reducing attack success to near zero.

7m read timeFrom unit42.paloaltonetworks.com
Post cover image
Table of contents
Executive SummaryBackgroundThe Methodology: Automated Predictive FuzzingHow Attacks Would Manifest in Real-World ScenariosVulnerable Model CategoriesConclusionAdditional Resources

Sort: