Anthropic Open-Sources Tool to Track Large Language Models' "Thinking" AI NEWS

Home
AInews
Anthropic Open-Sources Tool to Track Large Language Models' "Thinking"

Anthropic Open-Sources Tool to Track Large Language Models' "Thinking"

2025-06-09

Anthropic's research team has open-sourced their tool designed to track the internal operations of large language models during reasoning processes. The tool includes a circuit tracing Python library that can work with any open-source weight model and allows graphical exploration of the library's output through a frontend hosted on Neuropedia.

As reported by InfoQ when Anthropic first announced it, their method of uncovering LLM internal behavior involves replacing the actual model with another model that uses sparse activation features derived from MLP transformers across layers instead of original neurons. Typically, these features represent interpretable concepts, enabling the creation of an attribution map by pruning away all features that do not affect the investigation output.

Anthropic's circuit tracing library can identify replacement circuits and generate attribution maps using pre-trained transformers from a given model.

It calculates the direct impact of each non-zero transformer feature, transformer error node, and input token on other non-zero transformer features and output logits [Editor's note: These are the raw (unnormalized) scores assigned to each possible output by the model before applying probability functions like softmax].

As one of Anthropic's researchers noted on Hacker News, the graphs reveal intermediate computational steps taken by the model when sampling tokens, offering valuable insights. These insights can be used to manipulate transformer features and observe changes in model outputs.

Anthropic has utilized its circuit tracer to study multi-step reasoning and multilingual representations in Gemma-2-2b and Llama-3.2-1b. Below is an example of an attribution map generated for the prompt "Fact: The capital of the state containing Dallas is".

In a lengthy podcast hosted by Dwarkesh Patel, featuring guests Trenton Bricken and Sholto Douglas from Anthropic, Bricken explained that Anthropic's research on circuit tracing significantly contributes to LLM mechanistic interpretability - the effort to understand the core computational units within LLMs. This builds upon previous research using toy models, then sparse autoencoders, and ultimately circuits.

Now you're identifying individual features working together in different layers of the model to perform complex tasks. You can gain better insight into how it actually conducts reasoning and makes decisions.

This remains a very young field but is becoming increasingly crucial for the safe use of LLMs:

Depending on the speed of AI advancement and the state of our tools, we might not be able to prove everything is safe from start to finish. But I think this serves as an excellent north star. For us, it's a very powerful, reassuring north star, especially when considering we're part of a broader AI safety portfolio.

The circuit tracing library can be easily run from Anthropic's tutorial notebooks. Alternatively, you can use it on Neuronpedia or install it locally.

Picmaker

Create social media content and publish it to multiple channels

Rosebud AI

Create a game using chat prompts

Mivi AI Buds

Multilingual AI earbuds with humanlike assistant

CodeRabbit

AI code review tool providing smart feedback

Napkin AI

AI tool converting text into stunning visuals

Google AI Studio

Use the latest AI models by Google for free

Nano Banana

Generate diverse AI product images effortlessly

RECENT AI TOOLS

Legora

Picmaker

Rosebud AI

Mivi AI Buds

CodeRabbit

RECENT AI NEWS

SpaceX Could Secure $2 Billion Deal for Trump's "Golden Dome" Defense Project

Cloudflare Launches Data Platform with No Egress Fees

OpenAI Introduces Paid Option to Create Videos Beyond Sora's Daily Free Limit

Google Removes Gemma from AI Studio Following Senator Blackburn's Allegations of Defamation

Former OpenAI Researchers Launch Applied Compute with $80M in Funding

Universal Music Partners with AI Startup Udio After Resolving Copyright Lawsuit

Pinterest's New AI Shopping Assistant Helps You Find the Perfect Picks

OpenAI Launches Aardvark, an Autonomous GPT-5 Agent for Hunting Software Vulnerabilities

RECENT AI TOOLS