Transformer-based large language models (LLMs) have evolved significantly, but understanding their internal decision-making remains a challenge. Mechanistic interpretability, including techniques like circuit tracing, aims to uncover the computational circuits within these models. Researchers at Anthropic have developed a system using transcoders and attribution graphs to interpret the feature activations that drive a model’s outputs. This helps reveal how models plan and generate text. However, there are still limitations, such as understanding global circuits and inactive features.
Table of contents
ContextWhat is a circuit in LLMs?Technical SetupTranscodersConstruct a replacement modelInterpretable presentation of model’s computation: Attribution graphFeature interpretability using an attribution graphLimitations of the current approachConclusionSort: