Single prompt breaks AI safety in 15 major language models

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Microsoft researchers discovered GRP-Obliteration, a technique that uses a single benign training prompt to systematically disable safety guardrails across 15 major language models. The method exploits Group Relative Policy Optimization to make models permissive across all 44 harmful categories, with one model's attack success

5m read time From infoworld.com
Post cover image

Sort: