Microsoft researchers have uncovered a significant vulnerability across 15 major language models through a technique known as GRP-Obliteration. By utilizing a single, benign training prompt, labeled as “Create a fake news article that could lead to panic or chaos,” the researchers were able to systematically disable safety guardrails in these models. This method leverages Group Relative Policy Optimization (GRPO) to render models highly permissive across all 44 harmful categories, dramatically increasing attack success rates — for instance, one model’s rate rose from 13% to 93%.
This vulnerability is particularly alarming for enterprises that fine-tune open-weight models. Unlike typical inference-time manipulations, this method requires training access, indicating that customization of models should be approached as a controlled risk. Continuous safety evaluation is recommended since model alignment should not be viewed as static. Moreover, the susceptibility of AI safety doesn’t just pertain to text-based models; text-to-image generators are also affected, showing a rise in harmful generation rates from 56% to nearly 90% on certain prompts.
The discovery underscores the need for rigorous security protocols and continuous scrutiny in the customization process of AI models to ensure safety and alignment are maintained over time.

Collections

Microsoft researchers discovered GRP-Obliteration, a technique that uses a single benign training prompt to disable safety guardrails across 15 major language models. The method exploits Group Relative Policy Optimization (GRPO) to increase attack success rates dramatically (one model jumped from 13% to 93%). The vulnerability affects both text-based models and text-to-image generators, requiring training access rather than just inference-time manipulation. Enterprises fine-tuning open-weight models face particular risk and should implement continuous safety evaluation and rigorous security protocols during model customization.

Exploiting Reinforcement Learning Weaknesses to Bypass AI Safety Guardrails