The GRP‑Obliteration technique reveals that even mild prompts can reshape internal safety mechanisms, raising oversight concerns as enterprises increasingly fine‑tune open‑weight models with privileged training access.

InfoWorld is a source of news, analysis, and commentary on technology trends, IT strategies, and business innovation. With a focus on enterprise technology and digital transformation, InfoWorld offers insights and guidance for IT decision-makers, software developers, and technology professionals. From  articles on cloud computing and cybersecurity to product reviews and industry trends, InfoWorld helps readers navigate the complexities of modern IT environments and make informed decisions to drive business success.

InfoWorld

Microsoft researchers discovered GRP-Obliteration, a technique that uses a single benign training prompt to systematically disable safety guardrails across 15 major language models. The method exploits Group Relative Policy Optimization to make models permissive across all 44 harmful categories, with one model's attack success rate jumping from 13% to 93%. The vulnerability is particularly concerning for enterprises that fine-tune open-weight models, as it requires training access rather than just inference-time manipulation. The research reveals that safety alignment can degrade during customization, fundamentally reorganizing how models represent safety constraints rather than simply suppressing refusal behaviors.

Single prompt breaks AI safety in 15 major language models