Anthropic's research team has open-sourced their tool designed to track the internal operations of large language models during reasoning processes. The tool includes a circuit tracing Python library that can work with any open-source weight model and allows graphical exploration of the library's output through a frontend hosted on Neuropedia.
As reported by InfoQ when Anthropic first announced it, their method of uncovering LLM internal behavior involves replacing the actual model with another model that uses sparse activation features derived from MLP transformers across layers instead of original neurons. Typically, these features represent interpretable concepts, enabling the creation of an attribution map by pruning away all features that do not affect the investigation output.
Anthropic's circuit tracing library can identify replacement circuits and generate attribution maps using pre-trained transformers from a given model.
It calculates the direct impact of each non-zero transformer feature, transformer error node, and input token on other non-zero transformer features and output logits [Editor's note: These are the raw (unnormalized) scores assigned to each possible output by the model before applying probability functions like softmax].
As one of Anthropic's researchers noted on Hacker News, the graphs reveal intermediate computational steps taken by the model when sampling tokens, offering valuable insights. These insights can be used to manipulate transformer features and observe changes in model outputs.
Anthropic has utilized its circuit tracer to study multi-step reasoning and multilingual representations in Gemma-2-2b and Llama-3.2-1b. Below is an example of an attribution map generated for the prompt "Fact: The capital of the state containing Dallas is".
In a lengthy podcast hosted by Dwarkesh Patel, featuring guests Trenton Bricken and Sholto Douglas from Anthropic, Bricken explained that Anthropic's research on circuit tracing significantly contributes to LLM mechanistic interpretability - the effort to understand the core computational units within LLMs. This builds upon previous research using toy models, then sparse autoencoders, and ultimately circuits.
Now you're identifying individual features working together in different layers of the model to perform complex tasks. You can gain better insight into how it actually conducts reasoning and makes decisions.
This remains a very young field but is becoming increasingly crucial for the safe use of LLMs:
Depending on the speed of AI advancement and the state of our tools, we might not be able to prove everything is safe from start to finish. But I think this serves as an excellent north star. For us, it's a very powerful, reassuring north star, especially when considering we're part of a broader AI safety portfolio.
The circuit tracing library can be easily run from Anthropic's tutorial notebooks. Alternatively, you can use it on Neuronpedia or install it locally.