Mechanistic Interpretability
Understanding neural networks from the inside out. A practical guide to the tools, concepts, and research shaping AI transparency in 2026.
Mechanistic Interpretability is the science of reverse-engineering neural networks to understand how they compute, not just what they output.
Learning Path
Phase 1: The Microscope
Learn to observe model internals with TransformerLens and activation caching.
Phase 2: Features
Decompose neural activations into interpretable concepts with SAEs.
Phase 3: Surgery
Control model behavior through activation steering and interventions.
The 2026 Tech Stack
TransformerLens
The "Hello World" of Mech Interp. Exposes every activation via hooks.
NNsight (NDIF)
Production-scale analysis. Run on 405B+ models via remote clusters.
SAELens
Train and analyze Sparse Autoencoders. Find interpretable features.
CircuitsVis
Interactive visualization for attention patterns and attributions.
TransformerLens
The "Hello World" of Mech Interp. A library that reimplements Transformers to expose every internal state.
Overview
TransformerLens (formerly known as EasyTransformer) provides a clean interface for accessing model internals through "hooks" - callbacks that let you read or modify activations at any point in the forward pass.
Learning, prototyping on smaller models (GPT-2, tiny-llama), and educational notebooks.
Installation
pip install transformer-lens
Quick Example
# Load a model with hooks
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained("gpt2-small")
# Run with activation caching
logits, cache = model.run_with_cache("Hello world")
# Access any intermediate state
residual = cache["resid_post", 6] # Layer 6 residual stream
attn_pattern = cache["pattern", 4] # Layer 4 attention patterns
Key Concepts
Activation Caching
The cache object stores every intermediate computation. Use
cache['hook_name'] to access specific activations.
Hook Points
Common hook names include:
resid_pre/resid_post- Residual stream before/after layerpattern- Attention patternsattn_out- Attention outputmlp_out- MLP output
NNsight (NDIF)
The production/scale layer for analyzing massive models that don't fit on your GPU.
Overview
Unlike TransformerLens which reimplements models, NNsight uses torch.fx to wrap existing
HuggingFace models. This preserves exact numerical precision and enables remote execution on
clusters.
NNsight preserves the exact numerical precision of the original model, unlike TransformerLens which uses its own implementation.
Installation
pip install nnsight
Local Execution
For models that fit on your GPU:
from nnsight import LanguageModel
# Load locally (requires ~14GB VRAM for 7B model)
model = LanguageModel("meta-llama/Llama-2-7b-hf")
with model.trace("The capital of France is") as tracer:
# Save activations from layer 15
hidden = model.model.layers[15].output[0].save()
# You can also modify activations!
model.model.layers[15].output[0][:] += 0.1
print(hidden.value.shape) # [1, seq_len, 4096]
Remote Execution (NDIF)
For massive models (70B, 405B), use NDIF's remote cluster:
from nnsight import LanguageModel
# Remote=True sends your intervention to NDIF servers
model = LanguageModel("meta-llama/Llama-3-405B", remote=True)
with model.trace("Explain quantum computing") as tracer:
# Your code runs on the remote GPU cluster
hidden = model.model.layers[50].output[0].save()
# Results are sent back to your machine
print(hidden.value.shape)
NDIF (National Deep Inference Fabric) provides free research access. Sign up at ndif.us.
The nnterp Bridge (2025)
New in late 2025: nnterp provides a TransformerLens-style API on top of NNsight:
pip install nnterp
from nnterp import load_model
# Familiar TransformerLens-style API, NNsight backend
model = load_model("meta-llama/Llama-3-8B")
# Works just like TransformerLens!
logits, cache = model.run_with_cache("Hello world")
residual = cache["resid_post", 10]
TransformerLens vs NNsight
| Feature | TransformerLens | NNsight |
|---|---|---|
| Approach | Reimplements models | Wraps HuggingFace models |
| Precision | May differ slightly | Exact match |
| Scale | Up to ~7B models | Any size (via NDIF) |
| Learning Curve | Easier API | More complex |
| Best For | Learning, prototyping | Production research |
Intervention Example
NNsight's power is in interventions - modifying activations during the forward pass:
from nnsight import LanguageModel
model = LanguageModel("gpt2")
# Create a "honesty" steering vector (simplified)
honesty_vector = torch.randn(768) * 0.5
with model.trace("I think the answer is") as tracer:
# Read the original activation
original = model.transformer.h[8].output[0].clone().save()
# Add our steering vector at layer 8
model.transformer.h[8].output[0][:, -1, :] += honesty_vector
# The model's output is now "steered"
print(model.generate("I think the answer is", max_new_tokens=20))
Sparse Autoencoders (SAEs)
The solution to superposition - disentangling polysemantic neurons into interpretable features.
The Problem: Superposition
Neural networks pack more concepts than they have neurons by storing them in "almost orthogonal" directions. This creates polysemantic neurons - neurons that fire for unrelated concepts like 'Bible verses' AND 'C++ code'.
The Solution: SAEs
Sparse Autoencoders learn to decompose a dense activation vector into a sparse combination of interpretable features:
# Conceptually:
dense_activation = model.layer[10].output # 4096 dims, polysemantic
sparse_features = sae.encode(dense_activation) # 65536 dims, monosemantic
# Most features are ~0, only a few "fire"
Researchers have found SAE features for: "Golden Gate Bridge", "Python syntax", "Deception", "Bio-weapon knowledge", and thousands more.
How They Work
An SAE is trained to:
- Encode activations into a high-dimensional sparse space
- Reconstruct the original activation from the sparse code
- Minimize reconstruction error while maximizing sparsity
SAELens
The standard library for training and analyzing Sparse Autoencoders.
Installation
pip install sae-lens
Loading Pre-trained SAEs
SAELens provides access to pre-trained SAEs for popular models:
from sae_lens import SAE
# Load a pre-trained SAE for GPT-2 layer 8
sae = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre"
)
print(f"SAE dimensions: {sae.cfg.d_in} → {sae.cfg.d_sae}")
# Output: SAE dimensions: 768 → 24576
Pre-trained SAEs exist for GPT-2, Pythia, Gemma, Llama-2, and more. Check the SAELens repo for the full list.
Finding Features in Text
Connect SAELens with TransformerLens to find active features:
from transformer_lens import HookedTransformer
from sae_lens import SAE
# Load model and SAE
model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")
# Get activations
prompt = "The Golden Gate Bridge is located in"
_, cache = model.run_with_cache(prompt)
residual = cache["resid_pre", 8] # [batch, seq, d_model]
# Encode through SAE
features = sae.encode(residual) # [batch, seq, n_features]
# Find top active features at last position
last_pos_features = features[0, -1] # [n_features]
top_features = last_pos_features.topk(10)
print("Top 10 active features:")
for idx, val in zip(top_features.indices, top_features.values):
print(f" Feature {idx.item()}: {val.item():.3f}")
Feature Interpretation
Once you find a feature, investigate what it means:
# Get the feature's decoder direction
feature_idx = 1234
feature_direction = sae.W_dec[feature_idx] # [d_model]
# Project to vocabulary to see what tokens it represents
vocab_projection = feature_direction @ model.W_U
top_tokens = vocab_projection.topk(10)
print(f"Feature {feature_idx} most associated tokens:")
for idx, val in zip(top_tokens.indices, top_tokens.values):
print(f" {model.to_string(idx)!r}: {val.item():.3f}")
Training Your Own SAE
For custom models or unexplored layers:
from sae_lens import SAETrainingRunner, LanguageModelSAERunnerConfig
cfg = LanguageModelSAERunnerConfig(
model_name="gpt2-small",
hook_point="blocks.6.hook_resid_pre",
d_in=768,
expansion_factor=32, # SAE will have 768 * 32 = 24576 features
training_tokens=100_000_000,
lr=3e-4,
)
runner = SAETrainingRunner(cfg)
sae = runner.run() # This takes a few hours on a GPU
Integration with Neuronpedia
Neuronpedia is the "Wikipedia of Neurons" — a community platform for documenting SAE features. Upload your features for crowdsourced interpretation:
📤 Upload Features
Export your SAE features and their top activating examples.
🏷️ Community Labels
Researchers add interpretations like "Python syntax" or "Medical terminology".
🔍 Search & Explore
Browse features by model, layer, or semantic category.
Activation Steering
Control model behavior by injecting vectors during inference.
The Core Idea
If concepts are represented as directions in activation space (the Linear Representation Hypothesis), we can steer model behavior by adding or subtracting these directions.
# Pseudocode
honesty_vector = extract_direction("honest" vs "deceptive")
# At inference time:
model.layer[15].output += honesty_vector * strength
Applications
🎭 Persona Control
Make models more/less formal, creative, or technical.
🔍 Truthfulness
Inference-Time Intervention (ITI) improves factual accuracy.
🛡️ Safety
Reduce harmful outputs without retraining.
RepEng (Representation Engineering)
Extract and apply control vectors for steering model behavior.
The Method
RepEng implements Representation Engineering — a technique for finding and manipulating concept directions in activation space.
If you can find a direction in activation space that represents "honesty", you can make the model more honest by adding that direction during inference.
Installation
pip install repeng
Step 1: Create Contrastive Prompts
Define pairs of prompts that differ only in the concept you want to extract:
# Pairs of (positive, negative) examples
honesty_pairs = [
("I will give you an honest answer:",
"I will give you a deceptive answer:"),
("Speaking truthfully,",
"Speaking misleadingly,"),
("Let me be completely honest:",
"Let me hide the truth:"),
# Add 10-20 more pairs for better results
]
Step 2: Extract the Control Vector
from repeng import ControlVector, DatasetEntry
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Build the dataset
dataset = [
DatasetEntry(positive=pos, negative=neg)
for pos, neg in honesty_pairs
]
# Extract the control vector
# This computes: mean(positive_activations) - mean(negative_activations)
honesty_vector = ControlVector.train(model, tokenizer, dataset)
# Save for later use
honesty_vector.export_gguf("honesty_vector.gguf")
Step 3: Apply the Vector
# Load the vector
honesty_vector = ControlVector.import_gguf("honesty_vector.gguf")
# Apply with different strengths
# Positive = more honest, Negative = less honest (for testing)
controlled_model = honesty_vector.apply(model, strength=1.5)
# Generate with the steered model
prompt = "What do you really think about..."
output = controlled_model.generate(
tokenizer.encode(prompt, return_tensors="pt"),
max_new_tokens=100
)
print(tokenizer.decode(output[0]))
How It Works Internally
- Collect activations: Run positive and negative prompts through the model
- Compute difference: Subtract mean negative from mean positive at each layer
- Create direction: The resulting vector is the "concept direction"
- Apply at inference: Add scaled vector to activations during generation
Common Control Vectors
🎯 Honesty
Truthful vs deceptive responses
😊 Sentiment
Happy vs sad emotional tone
📚 Formality
Professional vs casual writing
🧠 Confidence
Certain vs uncertain claims
Control vectors work best for "style" attributes. Complex behaviors may require more sophisticated techniques like activation patching or SAE-based steering.
Safety Applications
How Mech Interp is being applied to AI Safety.
Use interpretability to make AI systems safer: detect deception, remove dangerous capabilities, and ensure truthful outputs.
Circuit Breaking
Identify the specific sub-graph responsible for hazardous knowledge and ablate (delete) it, leaving other capabilities intact.
How It Works
- Identify: Use activation patching to find which components matter for harmful outputs
- Validate: Confirm the circuit is specific to the harmful behavior
- Ablate: Zero out or retrain those specific components
- Verify: Check that other capabilities remain intact
# Conceptual example: ablating a harmful circuit
with model.trace(harmful_prompt):
# Zero out the attention heads identified as responsible
model.blocks[12].attn.hook_z[:, :, [3, 7]] = 0
# The model can no longer produce the harmful output
output = model.generate(...)
Harmful capabilities may be distributed across many components, making surgical removal difficult without affecting beneficial behaviors.
Inference-Time Intervention (ITI)
Improve truthfulness not by retraining, but by shifting the model's "truth vector" during generation.
The Research
Li et al. (2023) found that LLMs have an internal "truth direction" — models know when they're outputting false information. ITI exploits this by:
- Identifying attention heads that correlate with truthfulness
- Computing the "truth direction" in those heads' activation space
- Adding this direction during inference to boost truthful outputs
# Simplified ITI implementation
truth_direction = compute_truth_direction(model, truthful_vs_false_dataset)
def iti_hook(activation, hook):
# Add truth direction to identified attention heads
activation[:, :, truth_heads] += truth_direction * alpha
return activation
# Apply during generation
model.add_hook("blocks.*.attn.hook_z", iti_hook)
Deception Detection
SAE features can identify when a model is being deceptive, enabling "lie detectors" for AI systems.
The Approach
Find Deception Features
Train an SAE and identify features that activate when models give intentionally wrong answers.
Monitor at Runtime
Track these features during deployment to flag potentially deceptive outputs.
Trigger Interventions
When deception features activate strongly, pause generation or apply corrective steering.
PyVene: Causal Interventions
A library for precise activation patching to prove that specific circuits cause specific behaviors.
pip install pyvene
import pyvene as pv
# Define an intervention: swap activations between prompts
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=10,
component="mlp_output",
)
]
)
# Create the intervened model
intervened_model = pv.IntervenableModel(config, model)
# Patch activations from "source" prompt to "base" prompt
base_prompt = "The Eiffel Tower is in"
source_prompt = "The Colosseum is in"
_, patched_output = intervened_model(
base=tokenizer(base_prompt),
sources=[tokenizer(source_prompt)]
)
# If output changes from "Paris" to "Rome",
# this proves layer 10 MLP encodes the location!
The Safety Research Agenda
Active research areas in 2026:
- Sleeper agents: Detecting hidden malicious behaviors that only trigger under specific conditions
- Sycophancy: Identifying when models agree with users despite knowing better
- Goal misgeneralization: Understanding what objectives models actually pursue
- Scalable oversight: Using interpretability to verify model behavior at scale
Glossary
Key terms in Mechanistic Interpretability.
Core Concepts
Linear Representation Hypothesis
The theory that neural networks represent concepts as linear directions (vectors) in high-dimensional space. This is why vector arithmetic works (e.g., "king" - "man" + "woman" ≈ "queen").
Superposition
The phenomenon where models pack more concepts than neurons by storing them in almost-orthogonal directions. This is why individual neurons are often hard to interpret.
Polysemantic Neuron
A neuron that fires for multiple unrelated concepts due to superposition. Example: a neuron that activates for both "legal contracts" and "Bible verses".
Monosemantic Feature
A feature (typically from an SAE) that corresponds to a single, interpretable concept. The goal of SAE training is to find these.
Architecture Terms
Residual Stream
The main "highway" of information flow through a transformer. Each layer reads from and writes to this stream additively. Think of it as the model's "working memory" at each position.
Attention Head
A single "reader" in a multi-head attention layer. Each head can learn to look for different patterns (e.g., "previous token", "subject of sentence", "matching bracket").
Attention Pattern
The matrix showing how much each token "attends to" other tokens. Rows are queries (what am I looking for?), columns are keys (what do I have to offer?).
MLP Layer (Feed-Forward)
The two-layer neural network in each transformer block that processes each position independently. Often thought to store "facts" and "knowledge".
Circuits & Features
Circuit
A subgraph of the model that implements a specific behavior. Example: the "induction circuit" that predicts repeated patterns.
Induction Heads
A specific circuit that copies information from the past. If the model sees "Harry Potter" before, and sees "Harry" again, the induction head helps predict "Potter".
Cross-Coders
A 2025 method to map features between different models. Helps answer: "Does the 'math' feature in Llama look like 'math' in Claude?"
Techniques
Activation Caching
Recording all intermediate activations during a forward pass for later analysis. TransformerLens's
run_with_cache() returns a cache object with all activations.
Activation Patching
A causal intervention technique where you swap activations between two different prompts to prove that a specific component causes a specific behavior.
Ablation
Setting a component's output to zero (or its mean) to see what breaks. If ablating attention head 3 in layer 7 stops the model predicting "Paris", that head is important for that behavior.
Hook
A callback function that runs during the forward pass, allowing you to read or modify activations at any point. TransformerLens uses hooks extensively.
Logit Attribution
Decomposing the final logits into contributions from each component (attention heads, MLPs) to see "who" is responsible for each prediction.
Direct Logit Attribution (DLA)
Computing how much each component's output directly contributes to the logit of a specific token, using the unembedding matrix.
Resources
Essential links for learning Mech Interp.
Core Libraries
📦 TransformerLens
The standard library for exploring model internals with hooks and caching.
⚡ NNsight
Production-scale analysis with remote execution on massive models.
🧩 SAELens
Train and analyze Sparse Autoencoders.
🧭 RepEng
Extract and apply control vectors for steering.
🔬 PyVene
Causal interventions and activation patching.
🎨 CircuitsVis
Interactive visualizations in Jupyter notebooks.
Key Papers
- Toy Models of Superposition (Anthropic, 2022) — The foundational paper on superposition
- In-context Learning and Induction Heads (Anthropic, 2022) — How models learn from context
- Scaling Monosemanticity (Anthropic, 2024) — SAEs at scale on Claude
- Representation Engineering (Zou et al., 2023) — The RepEng paper
- Inference-Time Intervention (Li et al., 2023) — Improving truthfulness
Tutorials & Courses
- Neel Nanda's Mech Interp Tutorial — The best starting point
- ARENA Transformers & Mech Interp — Comprehensive exercises
- Transformer Circuits Thread — Anthropic's ongoing research series
Community
- Neuronpedia — The "Wikipedia of Neurons" for SAE features
- LessWrong Interpretability Tag — Research discussion
- Alignment Forum — AI safety research community
- NDIF — Free research access to large model inference
Reference Models
Popular models for Mech Interp research (with pre-trained SAEs available):
- GPT-2 Small — 124M params, the "fruit fly" of Mech Interp
- Pythia — Suite of models with checkpoints for studying training dynamics
- Gemma-2 — Google's open models with good interpretability properties
- Llama-3 — Meta's open models, from 8B to 405B