Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Transformer-based large language models (LLMs) have evolved significantly, but understanding their internal decision-making remains a challenge. Mechanistic interpretability, including techniques like circuit tracing, aims to uncover the computational circuits within these models. Researchers at Anthropic have developed a system using transcoders and attribution graphs to interpret the feature activations that drive a model’s outputs. This helps reveal how models plan and generate text. However, there are still limitations, such as understanding global circuits and inactive features.

Circuit Tracing: A Step Closer to Understanding Large Language Models

Interpretable presentation of model’s computation: Attribution graph

Feature interpretability using an attribution graph