Single prompt breaks AI safety in 15 major language models
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Microsoft researchers discovered GRP-Obliteration, a technique that uses a single benign training prompt to systematically disable safety guardrails across 15 major language models. The method exploits Group Relative Policy Optimization to make models permissive across all 44 harmful categories, with one model's attack success
•5m read time• From infoworld.com
Sort: