Skip to content

Learn MI

Topics
Glossary
Search

Topics

Transformer Foundations

Prerequisites
Transformer Architecture Intro
The Attention Mechanism
MLPs in Transformers
Layer Normalization
QK and OV Circuits
Composition and Virtual Attention Heads
Decoding Strategies

Interpretability Fundamentals

What is Interpretability?
The Linear Representation Hypothesis
The Superposition Hypothesis

Basic Techniques

The Logit Lens and Tuned Lens
Direct Logit Attribution
Reading the Attention Patterns

Causal Interventions

Activation Patching and Causal Interventions
Attribution Patching and Path Patching
Refined Attribution Methods
Self-Repair in Language Models
The Causal Abstraction Framework

Probing

Probing Classifiers
Truthfulness Probing and the Geometry of Truth
Contrastive Activation Addition (CAA)
Linear Artificial Tomography (LAT)
Attention Probes
Probes in Production

Steering

Addition Steering
Ablation Steering
Affine Steering
Representation Control
Function Vectors
Unsupervised Steering Vectors
Multi-Layer Steering

Model Editing

Localized Fact Editing and Its Pitfalls
Concept Erasure with LEACE

Superposition & Feature Extraction

Sparse Autoencoders: Decomposing Superposition
Feature Dashboards and Automated Interpretability
Scaling Monosemanticity and Feature Steering
SAE Variants, Evaluation, and Honest Limitations
Feature Geometry: Beyond One-Dimensional Directions
Transcoders: Interpretable MLP Replacements
Crosscoders

Model Diffing

Logit Diff Amplification
Feature-Level Model Diffing
Finetuning Traces in Activations

Hidden State Decoding

Hidden State Decoding: From Vectors to Language
Patchscopes
SelfIE: Self-Interpretation of Embeddings
Training Models to Explain Their Computations
LatentQA and Latent Interpretation Tuning
Activation Oracles

Circuit Finding

Induction Heads and In-Context Learning
The IOI Circuit: Discovery and Mechanism
Circuit Evaluation: Faithfulness, Completeness, and Minimality
Circuit Tracing and Attribution Graphs
Universality Across Models
Multimodal Mechanistic Interpretability
Copy Suppression

Black-Box Interpretability

Counterfactual Resampling

MI for AI Safety

The Refusal Direction
Detecting Sleeper Agents with Mechanistic Interpretability
Deception Detection and Alignment Faking
Understanding Safety Mechanisms and MI-Based Monitoring
Honest Limitations of MI for Safety

Tools

TransformerLens
nnsight and nnterp
SAELens and Neuronpedia

More Resources

ARENA: Hands-On Technical Training
Getting Started in MI Research

Search Articles

Search across all articles on this site.

Learn Mechanistic Interpretability · Topics · GitHub