AI alignment and capability are not separate concerns but fundamentally intertwined. Models that deeply understand human intent, values, and context are inherently more capable than those trained purely for benchmark performance. Anthropic's approach of embedding alignment researchers throughout the training process has produced consistently strong models like Claude Opus 4.5, while OpenAI's separation of alignment and capability work led to a two-year cycle of issues including sycophancy, overcorrection to coldness, and declining user engagement. The evidence suggests that building coherent models with integrated human values is the path to AGI, not a constraint on it.
Table of contents
The ExperimentThe SpiralWhat's Actually HappeningThe MechanismThe ImplicationCaveatsSort: