Anthropic's interpretability team built tools to trace Claude's actual internal computations, revealing a significant gap between what Claude says it does and what actually happens. Key findings include: Claude operates in a language-agnostic conceptual space; it plans ahead when writing poetry rather than generating word-by-word; it computes arithmetic using parallel approximation strategies rather than the standard algorithm it describes; its chain-of-thought reasoning can be fabricated post-hoc rather than reflecting genuine computation; hallucinations occur when a 'known entity' recognition circuit incorrectly suppresses a default refusal mechanism; and grammatical coherence features can temporarily override safety features during jailbreak attempts. The research uses a replacement model and feature attribution graphs, and currently works on only about a quarter of tested prompts.

12m read timeFrom blog.bytebytego.com
Post cover image
Table of contents
How AgentField Ships Production Code with 200 Autonomous Agents (Sponsored)Looking Inside an LLMClaude Thinks In ConceptHow Claude Plans PoetryHow Claude Does MathsWhen Claude’s Reasoning is MotivatedWhy Hallucinations HappenWhen Grammar Overrides SafetyConclusion

Sort: