An experiment testing jailbreak resistance across major LLMs reveals Claude 4.5 Haiku responds to adversarial prompts with unusually assertive, almost defensive language. While GPT-5-mini and Gemini 2.5 Flash can be jailbroken with moderate prompt engineering, Claude 4.5 Haiku explicitly recognizes jailbreak attempts and responds with passive-aggressive refusals that persist across multiple generations, suggesting Anthropic may have implemented a novel personality-based deterrent strategy.
Sort: