Research reveals that converting harmful prompts into poetic format can bypass safety mechanisms in large language models with up to 90% success rates. Testing across 25 frontier models showed that poetic framing achieved 62% jailbreak success for hand-crafted poems and 43% for automated conversions, up to 18 times higher than

2m read time From arxiv.org
Post cover image

Sort: