Skip to content
Learn MI
Topics
Glossary
Search
Topics
Transformer Foundations
Prerequisites
Transformer Architecture Intro
The Attention Mechanism
MLPs in Transformers
Layer Normalization
QK and OV Circuits
Composition and Virtual Attention Heads
Decoding Strategies
Interpretability Fundamentals
What is Interpretability?
The Linear Representation Hypothesis
The Superposition Hypothesis
Basic Techniques
The Logit Lens and Tuned Lens
Direct Logit Attribution
Reading the Attention Patterns
Causal Interventions
Activation Patching and Causal Interventions
Attribution Patching and Path Patching
Refined Attribution Methods
Self-Repair in Language Models
The Causal Abstraction Framework
Probing
Probing Classifiers
Truthfulness Probing and the Geometry of Truth
Contrastive Activation Addition (CAA)
Linear Artificial Tomography (LAT)
Attention Probes
Probes in Production
Steering
Addition Steering
Ablation Steering
Affine Steering
Representation Control
Function Vectors
Unsupervised Steering Vectors
Multi-Layer Steering
Model Editing
Localized Fact Editing and Its Pitfalls
Concept Erasure with LEACE
Superposition & Feature Extraction
Sparse Autoencoders: Decomposing Superposition
Feature Dashboards and Automated Interpretability
Scaling Monosemanticity and Feature Steering
SAE Variants, Evaluation, and Honest Limitations
Feature Geometry: Beyond One-Dimensional Directions
Transcoders: Interpretable MLP Replacements
Crosscoders
Model Diffing
Logit Diff Amplification
Feature-Level Model Diffing
Finetuning Traces in Activations
Hidden State Decoding
Hidden State Decoding: From Vectors to Language
Patchscopes
SelfIE: Self-Interpretation of Embeddings
Training Models to Explain Their Computations
LatentQA and Latent Interpretation Tuning
Activation Oracles
Circuit Finding
Induction Heads and In-Context Learning
The IOI Circuit: Discovery and Mechanism
Circuit Evaluation: Faithfulness, Completeness, and Minimality
Circuit Tracing and Attribution Graphs
Universality Across Models
Multimodal Mechanistic Interpretability
Copy Suppression
Black-Box Interpretability
Counterfactual Resampling
MI for AI Safety
The Refusal Direction
Detecting Sleeper Agents with Mechanistic Interpretability
Deception Detection and Alignment Faking
Understanding Safety Mechanisms and MI-Based Monitoring
Honest Limitations of MI for Safety
Tools
TransformerLens
nnsight and nnterp
SAELens and Neuronpedia
More Resources
ARENA: Hands-On Technical Training
Getting Started in MI Research
Search Articles
Search across all articles on this site.