Anthropic AI has demonstrated the ability to manipulate neurons in transformer models to control responses. The superposition hypothesis suggests that neurons and features coexist in a superposed state. Anthropic separates overlapping features using Sparse AutoEncoder. The analysis of neural networks is important for the

5m read timeFrom blog.gopenai.com
Post cover image
Table of contents
Superposition Hypothesis for steering LLM with Sparse AutoEncoderExplainable AISparse AutoEncoerTheoretical Analysis

Sort: