Anthropic AI has demonstrated the ability to manipulate neurons in transformer models to control responses. The superposition hypothesis suggests that neurons and features coexist in a superposed state. Anthropic separates overlapping features using Sparse AutoEncoder. The analysis of neural networks is important for the
Table of contents
Superposition Hypothesis for steering LLM with Sparse AutoEncoderExplainable AISparse AutoEncoerTheoretical AnalysisSort: