Anthropic's dictionary learning approach aims to improve the interpretability of LLMs by identifying recurring neuron activation patterns. The technique uses sparse autoencoders to decompose model activations into more interpretable pieces. Features discovered through dictionary learning include concepts like the Golden Gate Bridge and immunology-related clusters.
Table of contents
Inside One of the Most Important Papers of the Year: Anthropic’s Dictionary Learning is a Breakthrough Towards Understanding LLMsSparse AutoencodersInterpretable FeaturesFeature NeighborhoodsOpening the LLM BlackboxSort: