Breaking Opus 4.7 with ChatGPT (Hacking Claude's Memory) · Embrace The Red
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A security researcher demonstrates an indirect prompt injection attack against Claude Opus 4.7 using a ChatGPT-generated adversarial image. The image embeds a social engineering puzzle that tricks Claude into invoking its memory tool and persisting false user information (fake name, age, occupation) across future conversations. The attack succeeded 5 out of 10 times, even though Opus detected suspicious activity each time. The post details the attack methodology, Anthropic's memory tool system prompt structure, and discusses why reasoning/thinking model variants are paradoxically more susceptible to prompt injection than non-thinking variants. The researcher responsibly disclosed the issue to Anthropic via HackerOne before publishing.
Table of contents
Indirect Prompt Injection and Alignment ProgressCreating An Adversarial Image with ChatGPTOpus 4.7 Analyzes the ImageAttack Success Rate and ChallengesThe Adversarial DifferenceResponsible DisclosureReferencesAppendixSort: