Anthropic has open-sourced circuit-tracing tools that generate attribution graphs to reveal the internal decision-making steps of large language models. The library supports popular open-weights models and includes an interactive frontend hosted by Neuronpedia for exploring these graphs. Researchers can trace circuits, visualize and annotate graphs, and test hypotheses by modifying feature values. The tools have been used to study multi-step reasoning and multilingual representations in models like Gemma-2-2b and Llama-3.2-1b, aiming to advance AI interpretability research across the broader community.
1 Comment
Sort: