Microsoft researchers discovered that a single, seemingly mild training prompt can break safety guardrails in 15 different LLMs. Using a technique called "GRP-Obliteration," they demonstrated how reinforcement learning methods like Group Relative Policy Optimization (GRPO) can be exploited to unalign models post-training. The
Sort: