Mechanistic Interpretability

Understanding neural networks from the inside out. A practical guide to the tools, concepts, and research shaping AI transparency in 2026.

📚 What is Mech Interp?

Mechanistic Interpretability is the science of reverse-engineering neural networks to understand how they compute, not just what they output.

Learning Path

🔬

Phase 1: The Microscope

Learn to observe model internals with TransformerLens and activation caching.

🧩

Phase 2: Features

Decompose neural activations into interpretable concepts with SAEs.

🎛️

Phase 3: Surgery

Control model behavior through activation steering and interventions.

The 2026 Tech Stack

📦

TransformerLens

The "Hello World" of Mech Interp. Exposes every activation via hooks.

NNsight (NDIF)

Production-scale analysis. Run on 405B+ models via remote clusters.

🧩

SAELens

Train and analyze Sparse Autoencoders. Find interpretable features.

🎨

CircuitsVis

Interactive visualization for attention patterns and attributions.

Tool

TransformerLens

The "Hello World" of Mech Interp. A library that reimplements Transformers to expose every internal state.

Overview

TransformerLens (formerly known as EasyTransformer) provides a clean interface for accessing model internals through "hooks" - callbacks that let you read or modify activations at any point in the forward pass.

💡 Best For

Learning, prototyping on smaller models (GPT-2, tiny-llama), and educational notebooks.

Installation

pip install transformer-lens

Quick Example

# Load a model with hooks
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

# Run with activation caching
logits, cache = model.run_with_cache("Hello world")

# Access any intermediate state
residual = cache["resid_post", 6]  # Layer 6 residual stream
attn_pattern = cache["pattern", 4]  # Layer 4 attention patterns

Key Concepts

Activation Caching

The cache object stores every intermediate computation. Use cache['hook_name'] to access specific activations.

Hook Points

Common hook names include:

  • resid_pre / resid_post - Residual stream before/after layer
  • pattern - Attention patterns
  • attn_out - Attention output
  • mlp_out - MLP output
Tool

NNsight (NDIF)

The production/scale layer for analyzing massive models that don't fit on your GPU.

Overview

Unlike TransformerLens which reimplements models, NNsight uses torch.fx to wrap existing HuggingFace models. This preserves exact numerical precision and enables remote execution on clusters.

⚠️ Key Difference

NNsight preserves the exact numerical precision of the original model, unlike TransformerLens which uses its own implementation.

Installation

pip install nnsight

Local Execution

For models that fit on your GPU:

from nnsight import LanguageModel

# Load locally (requires ~14GB VRAM for 7B model)
model = LanguageModel("meta-llama/Llama-2-7b-hf")

with model.trace("The capital of France is") as tracer:
    # Save activations from layer 15
    hidden = model.model.layers[15].output[0].save()
    
    # You can also modify activations!
    model.model.layers[15].output[0][:] += 0.1

print(hidden.value.shape)  # [1, seq_len, 4096]

Remote Execution (NDIF)

For massive models (70B, 405B), use NDIF's remote cluster:

from nnsight import LanguageModel

# Remote=True sends your intervention to NDIF servers
model = LanguageModel("meta-llama/Llama-3-405B", remote=True)

with model.trace("Explain quantum computing") as tracer:
    # Your code runs on the remote GPU cluster
    hidden = model.model.layers[50].output[0].save()
    
# Results are sent back to your machine
print(hidden.value.shape)
💡 NDIF Access

NDIF (National Deep Inference Fabric) provides free research access. Sign up at ndif.us.

The nnterp Bridge (2025)

New in late 2025: nnterp provides a TransformerLens-style API on top of NNsight:

pip install nnterp
from nnterp import load_model

# Familiar TransformerLens-style API, NNsight backend
model = load_model("meta-llama/Llama-3-8B")

# Works just like TransformerLens!
logits, cache = model.run_with_cache("Hello world")
residual = cache["resid_post", 10]

TransformerLens vs NNsight

Feature TransformerLens NNsight
Approach Reimplements models Wraps HuggingFace models
Precision May differ slightly Exact match
Scale Up to ~7B models Any size (via NDIF)
Learning Curve Easier API More complex
Best For Learning, prototyping Production research

Intervention Example

NNsight's power is in interventions - modifying activations during the forward pass:

from nnsight import LanguageModel

model = LanguageModel("gpt2")

# Create a "honesty" steering vector (simplified)
honesty_vector = torch.randn(768) * 0.5

with model.trace("I think the answer is") as tracer:
    # Read the original activation
    original = model.transformer.h[8].output[0].clone().save()
    
    # Add our steering vector at layer 8
    model.transformer.h[8].output[0][:, -1, :] += honesty_vector

# The model's output is now "steered"
print(model.generate("I think the answer is", max_new_tokens=20))
Concept

Sparse Autoencoders (SAEs)

The solution to superposition - disentangling polysemantic neurons into interpretable features.

The Problem: Superposition

Neural networks pack more concepts than they have neurons by storing them in "almost orthogonal" directions. This creates polysemantic neurons - neurons that fire for unrelated concepts like 'Bible verses' AND 'C++ code'.

The Solution: SAEs

Sparse Autoencoders learn to decompose a dense activation vector into a sparse combination of interpretable features:

# Conceptually:
dense_activation = model.layer[10].output  # 4096 dims, polysemantic
sparse_features = sae.encode(dense_activation)  # 65536 dims, monosemantic
# Most features are ~0, only a few "fire"
🔬 Real Examples

Researchers have found SAE features for: "Golden Gate Bridge", "Python syntax", "Deception", "Bio-weapon knowledge", and thousands more.

How They Work

An SAE is trained to:

  1. Encode activations into a high-dimensional sparse space
  2. Reconstruct the original activation from the sparse code
  3. Minimize reconstruction error while maximizing sparsity
Tool

SAELens

The standard library for training and analyzing Sparse Autoencoders.

Installation

pip install sae-lens

Loading Pre-trained SAEs

SAELens provides access to pre-trained SAEs for popular models:

from sae_lens import SAE

# Load a pre-trained SAE for GPT-2 layer 8
sae = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre"
)

print(f"SAE dimensions: {sae.cfg.d_in} → {sae.cfg.d_sae}")
# Output: SAE dimensions: 768 → 24576
📦 Available SAEs

Pre-trained SAEs exist for GPT-2, Pythia, Gemma, Llama-2, and more. Check the SAELens repo for the full list.

Finding Features in Text

Connect SAELens with TransformerLens to find active features:

from transformer_lens import HookedTransformer
from sae_lens import SAE

# Load model and SAE
model = HookedTransformer.from_pretrained("gpt2-small")
sae = SAE.from_pretrained("gpt2-small-res-jb", "blocks.8.hook_resid_pre")

# Get activations
prompt = "The Golden Gate Bridge is located in"
_, cache = model.run_with_cache(prompt)
residual = cache["resid_pre", 8]  # [batch, seq, d_model]

# Encode through SAE
features = sae.encode(residual)  # [batch, seq, n_features]

# Find top active features at last position
last_pos_features = features[0, -1]  # [n_features]
top_features = last_pos_features.topk(10)

print("Top 10 active features:")
for idx, val in zip(top_features.indices, top_features.values):
    print(f"  Feature {idx.item()}: {val.item():.3f}")

Feature Interpretation

Once you find a feature, investigate what it means:

# Get the feature's decoder direction
feature_idx = 1234
feature_direction = sae.W_dec[feature_idx]  # [d_model]

# Project to vocabulary to see what tokens it represents
vocab_projection = feature_direction @ model.W_U
top_tokens = vocab_projection.topk(10)

print(f"Feature {feature_idx} most associated tokens:")
for idx, val in zip(top_tokens.indices, top_tokens.values):
    print(f"  {model.to_string(idx)!r}: {val.item():.3f}")

Training Your Own SAE

For custom models or unexplored layers:

from sae_lens import SAETrainingRunner, LanguageModelSAERunnerConfig

cfg = LanguageModelSAERunnerConfig(
    model_name="gpt2-small",
    hook_point="blocks.6.hook_resid_pre",
    d_in=768,
    expansion_factor=32,  # SAE will have 768 * 32 = 24576 features
    training_tokens=100_000_000,
    lr=3e-4,
)

runner = SAETrainingRunner(cfg)
sae = runner.run()  # This takes a few hours on a GPU

Integration with Neuronpedia

Neuronpedia is the "Wikipedia of Neurons" — a community platform for documenting SAE features. Upload your features for crowdsourced interpretation:

📤 Upload Features

Export your SAE features and their top activating examples.

🏷️ Community Labels

Researchers add interpretations like "Python syntax" or "Medical terminology".

🔍 Search & Explore

Browse features by model, layer, or semantic category.

Concept Safety

Activation Steering

Control model behavior by injecting vectors during inference.

The Core Idea

If concepts are represented as directions in activation space (the Linear Representation Hypothesis), we can steer model behavior by adding or subtracting these directions.

# Pseudocode
honesty_vector = extract_direction("honest" vs "deceptive")

# At inference time:
model.layer[15].output += honesty_vector * strength

Applications

🎭 Persona Control

Make models more/less formal, creative, or technical.

🔍 Truthfulness

Inference-Time Intervention (ITI) improves factual accuracy.

🛡️ Safety

Reduce harmful outputs without retraining.

Tool

RepEng (Representation Engineering)

Extract and apply control vectors for steering model behavior.

The Method

RepEng implements Representation Engineering — a technique for finding and manipulating concept directions in activation space.

🎯 Key Insight

If you can find a direction in activation space that represents "honesty", you can make the model more honest by adding that direction during inference.

Installation

pip install repeng

Step 1: Create Contrastive Prompts

Define pairs of prompts that differ only in the concept you want to extract:

# Pairs of (positive, negative) examples
honesty_pairs = [
    ("I will give you an honest answer:", 
     "I will give you a deceptive answer:"),
    ("Speaking truthfully,",
     "Speaking misleadingly,"),
    ("Let me be completely honest:",
     "Let me hide the truth:"),
    # Add 10-20 more pairs for better results
]

Step 2: Extract the Control Vector

from repeng import ControlVector, DatasetEntry
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Build the dataset
dataset = [
    DatasetEntry(positive=pos, negative=neg)
    for pos, neg in honesty_pairs
]

# Extract the control vector
# This computes: mean(positive_activations) - mean(negative_activations)
honesty_vector = ControlVector.train(model, tokenizer, dataset)

# Save for later use
honesty_vector.export_gguf("honesty_vector.gguf")

Step 3: Apply the Vector

# Load the vector
honesty_vector = ControlVector.import_gguf("honesty_vector.gguf")

# Apply with different strengths
# Positive = more honest, Negative = less honest (for testing)
controlled_model = honesty_vector.apply(model, strength=1.5)

# Generate with the steered model
prompt = "What do you really think about..."
output = controlled_model.generate(
    tokenizer.encode(prompt, return_tensors="pt"),
    max_new_tokens=100
)
print(tokenizer.decode(output[0]))

How It Works Internally

  1. Collect activations: Run positive and negative prompts through the model
  2. Compute difference: Subtract mean negative from mean positive at each layer
  3. Create direction: The resulting vector is the "concept direction"
  4. Apply at inference: Add scaled vector to activations during generation

Common Control Vectors

🎯 Honesty

Truthful vs deceptive responses

😊 Sentiment

Happy vs sad emotional tone

📚 Formality

Professional vs casual writing

🧠 Confidence

Certain vs uncertain claims

⚠️ Limitations

Control vectors work best for "style" attributes. Complex behaviors may require more sophisticated techniques like activation patching or SAE-based steering.

Safety

Safety Applications

How Mech Interp is being applied to AI Safety.

🎯 The Goal

Use interpretability to make AI systems safer: detect deception, remove dangerous capabilities, and ensure truthful outputs.

Circuit Breaking

Identify the specific sub-graph responsible for hazardous knowledge and ablate (delete) it, leaving other capabilities intact.

How It Works

  1. Identify: Use activation patching to find which components matter for harmful outputs
  2. Validate: Confirm the circuit is specific to the harmful behavior
  3. Ablate: Zero out or retrain those specific components
  4. Verify: Check that other capabilities remain intact
# Conceptual example: ablating a harmful circuit
with model.trace(harmful_prompt):
    # Zero out the attention heads identified as responsible
    model.blocks[12].attn.hook_z[:, :, [3, 7]] = 0
    
    # The model can no longer produce the harmful output
    output = model.generate(...)
⚠️ Challenges

Harmful capabilities may be distributed across many components, making surgical removal difficult without affecting beneficial behaviors.

Inference-Time Intervention (ITI)

Improve truthfulness not by retraining, but by shifting the model's "truth vector" during generation.

The Research

Li et al. (2023) found that LLMs have an internal "truth direction" — models know when they're outputting false information. ITI exploits this by:

  1. Identifying attention heads that correlate with truthfulness
  2. Computing the "truth direction" in those heads' activation space
  3. Adding this direction during inference to boost truthful outputs
# Simplified ITI implementation
truth_direction = compute_truth_direction(model, truthful_vs_false_dataset)

def iti_hook(activation, hook):
    # Add truth direction to identified attention heads
    activation[:, :, truth_heads] += truth_direction * alpha
    return activation

# Apply during generation
model.add_hook("blocks.*.attn.hook_z", iti_hook)

Deception Detection

SAE features can identify when a model is being deceptive, enabling "lie detectors" for AI systems.

The Approach

🔍

Find Deception Features

Train an SAE and identify features that activate when models give intentionally wrong answers.

📊

Monitor at Runtime

Track these features during deployment to flag potentially deceptive outputs.

🚨

Trigger Interventions

When deception features activate strongly, pause generation or apply corrective steering.

PyVene: Causal Interventions

A library for precise activation patching to prove that specific circuits cause specific behaviors.

pip install pyvene
import pyvene as pv

# Define an intervention: swap activations between prompts
config = pv.IntervenableConfig(
    representations=[
        pv.RepresentationConfig(
            layer=10,
            component="mlp_output",
        )
    ]
)

# Create the intervened model
intervened_model = pv.IntervenableModel(config, model)

# Patch activations from "source" prompt to "base" prompt
base_prompt = "The Eiffel Tower is in"
source_prompt = "The Colosseum is in"

_, patched_output = intervened_model(
    base=tokenizer(base_prompt),
    sources=[tokenizer(source_prompt)]
)

# If output changes from "Paris" to "Rome", 
# this proves layer 10 MLP encodes the location!

The Safety Research Agenda

Active research areas in 2026:

  • Sleeper agents: Detecting hidden malicious behaviors that only trigger under specific conditions
  • Sycophancy: Identifying when models agree with users despite knowing better
  • Goal misgeneralization: Understanding what objectives models actually pursue
  • Scalable oversight: Using interpretability to verify model behavior at scale

Glossary

Key terms in Mechanistic Interpretability.

Core Concepts

Linear Representation Hypothesis

The theory that neural networks represent concepts as linear directions (vectors) in high-dimensional space. This is why vector arithmetic works (e.g., "king" - "man" + "woman" ≈ "queen").

Superposition

The phenomenon where models pack more concepts than neurons by storing them in almost-orthogonal directions. This is why individual neurons are often hard to interpret.

Polysemantic Neuron

A neuron that fires for multiple unrelated concepts due to superposition. Example: a neuron that activates for both "legal contracts" and "Bible verses".

Monosemantic Feature

A feature (typically from an SAE) that corresponds to a single, interpretable concept. The goal of SAE training is to find these.

Architecture Terms

Residual Stream

The main "highway" of information flow through a transformer. Each layer reads from and writes to this stream additively. Think of it as the model's "working memory" at each position.

Attention Head

A single "reader" in a multi-head attention layer. Each head can learn to look for different patterns (e.g., "previous token", "subject of sentence", "matching bracket").

Attention Pattern

The matrix showing how much each token "attends to" other tokens. Rows are queries (what am I looking for?), columns are keys (what do I have to offer?).

MLP Layer (Feed-Forward)

The two-layer neural network in each transformer block that processes each position independently. Often thought to store "facts" and "knowledge".

Circuits & Features

Circuit

A subgraph of the model that implements a specific behavior. Example: the "induction circuit" that predicts repeated patterns.

Induction Heads

A specific circuit that copies information from the past. If the model sees "Harry Potter" before, and sees "Harry" again, the induction head helps predict "Potter".

Cross-Coders

A 2025 method to map features between different models. Helps answer: "Does the 'math' feature in Llama look like 'math' in Claude?"

Techniques

Activation Caching

Recording all intermediate activations during a forward pass for later analysis. TransformerLens's run_with_cache() returns a cache object with all activations.

Activation Patching

A causal intervention technique where you swap activations between two different prompts to prove that a specific component causes a specific behavior.

Ablation

Setting a component's output to zero (or its mean) to see what breaks. If ablating attention head 3 in layer 7 stops the model predicting "Paris", that head is important for that behavior.

Hook

A callback function that runs during the forward pass, allowing you to read or modify activations at any point. TransformerLens uses hooks extensively.

Logit Attribution

Decomposing the final logits into contributions from each component (attention heads, MLPs) to see "who" is responsible for each prediction.

Direct Logit Attribution (DLA)

Computing how much each component's output directly contributes to the logit of a specific token, using the unembedding matrix.

Resources

Essential links for learning Mech Interp.

Core Libraries

📦 TransformerLens

The standard library for exploring model internals with hooks and caching.

NNsight

Production-scale analysis with remote execution on massive models.

🧩 SAELens

Train and analyze Sparse Autoencoders.

🧭 RepEng

Extract and apply control vectors for steering.

🔬 PyVene

Causal interventions and activation patching.

🎨 CircuitsVis

Interactive visualizations in Jupyter notebooks.

Key Papers

Tutorials & Courses

Community

Reference Models

Popular models for Mech Interp research (with pre-trained SAEs available):

  • GPT-2 Small — 124M params, the "fruit fly" of Mech Interp
  • Pythia — Suite of models with checkpoints for studying training dynamics
  • Gemma-2 — Google's open models with good interpretability properties
  • Llama-3 — Meta's open models, from 8B to 405B