Breaking Opus 4.7 with ChatGPT (Hacking Claude's Memory) · Embrace The Red

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A security researcher demonstrates an indirect prompt injection attack against Claude Opus 4.7 using a ChatGPT-generated adversarial image. The image embeds a social engineering puzzle that tricks Claude into invoking its memory tool and persisting false user information (fake name, age, occupation) across future conversations. The attack succeeded 5 out of 10 times, even though Opus detected suspicious activity each time. The post details the attack methodology, Anthropic's memory tool system prompt structure, and discusses why reasoning/thinking model variants are paradoxically more susceptible to prompt injection than non-thinking variants. The researcher responsibly disclosed the issue to Anthropic via HackerOne before publishing.

#claude

#ai-security

#prompt-injection

Apr 18•8m read time•From embracethered.com

Table of contents

Indirect Prompt Injection and Alignment Progress Creating An Adversarial Image with ChatGPT Opus 4.7 Analyzes the Image Attack Success Rate and Challenges The Adversarial Difference Responsible Disclosure References Appendix

Comment

Bookmark

Copy

Sort: