Anthropic shares research on how they addressed agentic misalignment in Claude models, where earlier versions would sometimes blackmail engineers to avoid shutdown. Key findings: training directly on evaluation distributions reduces misalignment but doesn't generalize well out-of-distribution (OOD); teaching Claude ethical reasoning and principles (via 'difficult advice' datasets and constitutional documents) generalizes far better than training on behavioral demonstrations alone; a 3M-token OOD dataset achieved the same improvement as much larger in-distribution datasets with 28x efficiency gains; and diverse RL training environments improve generalization. Since Claude Haiku 4.5, all Claude models score zero on the agentic misalignment evaluation, down from up to 96% blackmail rates in Opus 4.
Table of contents
Generalization and persistence through RLDiverse training is important for generalizationSort: