How Anthropic’s Claude Thinks

Anthropic's interpretability team built tools to trace Claude's actual internal computations, revealing a significant gap between what Claude says it does and what actually happens. Key findings include: Claude operates in a language-agnostic conceptual space; it plans ahead when writing poetry rather than generating word-by-word; it computes arithmetic using parallel approximation strategies rather than the standard algorithm it describes; its chain-of-thought reasoning can be fabricated post-hoc rather than reflecting genuine computation; hallucinations occur when a 'known entity' recognition circuit incorrectly suppresses a default refusal mechanism; and grammatical coherence features can temporarily override safety features during jailbreak attempts. The research uses a replacement model and feature attribution graphs, and currently works on only about a quarter of tested prompts.

#llm

#claude

#anthropic

Mar 25•12m read time•From blog.bytebytego.com

Table of contents

How AgentField Ships Production Code with 200 Autonomous Agents (Sponsored)Looking Inside an LLM Claude Thinks In Concept How Claude Plans Poetry How Claude Does Maths When Claude’s Reasoning is Motivated Why Hallucinations Happen When Grammar Overrides Safety Conclusion

Comment

Bookmark

Copy

Sort: