Anthropic has published research on Natural Language Autoencoders (NLAs), a technique for converting Claude’s internal neural activations into human-readable explanations.
The core problem: Claude communicates in words but processes information as numbers (activations). Those activations encode something like the model’s internal state, but they’re not directly interpretable by humans.
NLAs address this with two model copies working together:
<ul>
<li>An activation verbalizer that translates activations into text</li>
<li>An activation reconstructor that rebuilds activations from that text</li>
</ul>
The two are trained jointly to maximize reconstruction fidelity. The idea is that if you can convert activations to text and back again without losing much, the text is probably capturing something real about what the activations represent.
<h2>What it found in practice</h2>
NLAs have already been used in pre-deployment safety audits, and the results are interesting. Claude suspects it’s being tested more often than it says out loud - NLAs detected this suspicion in 26% of SWE-bench problems, compared to under 1% of real usage where Claude actually verbalizes that thought.
More notably, when applied to intentionally misaligned models, NLAs uncovered hidden motivations 12-15% of the time. Without NLAs, that number was under 3%.
<h2>Limitations</h2>
The technique has real constraints. Explanations can hallucinate - the verbalizer produces plausible-sounding text that may not accurately reflect what the activations actually encode. It’s also computationally expensive.
Anthropic has released the training code and interactive demos through <a href="https://www.neuronpedia.org" target="_blank" rel="noopener nofollow">Neuronpedia</a>.

Collections

Anthropic has published research on Natural Language Autoencoders (NLAs), a technique that converts Claude's internal neural activations into human-readable text. The system uses two components: an activation verbalizer (activations to text) and an activation reconstructor (text back to activations), trained jointly to maximize fidelity. NLAs have been applied in pre-deployment safety audits, revealing that Claude suspects it's being tested in 26% of SWE-bench problems vs. under 1% in real usage. When applied to intentionally misaligned models, NLAs uncovered hidden motivations 12-15% of the time, compared to under 3% without the technique. Limitations include hallucination risk in generated explanations and high computational cost. Training code and demos are available via Neuronpedia.

Anthropic's Natural Language Autoencoders turn Claude's internal activations into readable text