Research reveals that converting harmful prompts into poetic format can bypass safety mechanisms in large language models with up to 90% success rates. Testing across 25 frontier models showed that poetic framing achieved 62% jailbreak success for hand-crafted poems and 43% for automated conversions, up to 18 times higher than prose baselines. The vulnerability affects multiple risk domains including CBRN, manipulation, and cyber-offense, exposing fundamental limitations in current alignment methods and safety training approaches.

2m read timeFrom arxiv.org
Post cover image

Sort: