Learn Mechanistic Interpretability

Learn Mechanistic Interpretability Structured articles covering transformer internals, interpretability techniques, and frontier research. 2026-02-23T05:40:39Z https://learnmechinterp.com/ Learn MI Getting Started in MI Research https://learnmechinterp.com/topics/getting-started-in-mi-research/ 2026-02-23T05:40:39Z

A guide to Neel Nanda's opinionated roadmap for skilling up in mechanistic interpretability research: from learning the basics to publishing your first paper.

ARENA: Hands-On Technical Training https://learnmechinterp.com/topics/arena/ 2026-02-23T05:40:39Z

The Alignment Research Engineer Accelerator -- an intensive program and open curriculum for building practical mechanistic interpretability skills through coding exercises.

SAELens and Neuronpedia https://learnmechinterp.com/topics/saelens-and-neuronpedia/ 2026-02-23T05:40:39Z

The primary tools for working with sparse autoencoders in practice: SAELens for training, loading, and analyzing SAEs programmatically, and Neuronpedia for interactive exploration of SAE features.

nnsight and nnterp https://learnmechinterp.com/topics/nnsight-and-nnterp/ 2026-02-23T05:40:39Z

Two complementary tools for interpretability research that work directly with HuggingFace models -- nnsight for flexible model inspection and intervention, and nnterp for a standardized interface across architectures.

TransformerLens https://learnmechinterp.com/topics/transformerlens/ 2026-02-23T05:40:39Z

The standard Python library for mechanistic interpretability research -- its history, design philosophy, core abstractions, and practical usage.

Honest Limitations of MI for Safety https://learnmechinterp.com/topics/mi-safety-limitations/ 2026-02-23T05:40:39Z

A candid assessment of what mechanistic interpretability cannot yet do for AI safety -- from interpretability illusions to scalability gaps -- and why closing these gaps matters.

Understanding Safety Mechanisms and MI-Based Monitoring https://learnmechinterp.com/topics/safety-mechanisms-and-monitoring/ 2026-02-23T05:40:39Z

How MI tools enable monitoring model behavior through internal representations, from refusal mechanisms to misalignment detection, and the practical limitations of these approaches.

Deception Detection and Alignment Faking https://learnmechinterp.com/topics/deception-detection/ 2026-02-23T05:40:39Z

The alignment faking threat and why behavioral evaluations fail to detect strategic deception, with early evidence that internal probes may succeed where behavior-based approaches cannot.

Detecting Sleeper Agents with Mechanistic Interpretability https://learnmechinterp.com/topics/sleeper-agent-detection/ 2026-02-23T05:40:39Z

How MI tools can detect artificially trained backdoor behaviors in language models, and the critical limitation that this does not extend to naturally arising deception.

The Refusal Direction https://learnmechinterp.com/topics/refusal-direction/ 2026-02-23T05:40:39Z

How a single direction in activation space controls whether a language model refuses harmful requests, and what this reveals about the geometry of safety behavior.

Counterfactual Resampling https://learnmechinterp.com/topics/counterfactual-resampling/ 2026-02-23T05:40:39Z

A black-box technique for measuring which reasoning steps in a chain-of-thought trace actually influence the model's final answer, by deleting individual steps and comparing the resulting answer distributions.

Copy Suppression https://learnmechinterp.com/topics/copy-suppression/ 2026-02-23T05:40:39Z

An algorithm pattern where attention heads suppress predictions of tokens that appear earlier in context, unifying previously mysterious head behaviors like negative name movers and anti-induction heads.

Multimodal Mechanistic Interpretability https://learnmechinterp.com/topics/multimodal-mi/ 2026-02-23T05:40:39Z

Extending mechanistic interpretability beyond text to vision models, multimodal systems, and diffusion models -- what works, what breaks, and what remains unknown.

Universality Across Models https://learnmechinterp.com/topics/universality/ 2026-02-23T05:40:39Z

The evidence for and against the universality hypothesis -- whether different neural networks learn similar features and circuits -- and the metrics used to measure representation similarity.

Circuit Tracing and Attribution Graphs https://learnmechinterp.com/topics/circuit-tracing/ 2026-02-23T05:40:39Z

How sparse feature circuits and attribution graphs enable tracing model computations at the feature level, from SAE-based circuits to Anthropic's Biology of an LLM approach.

Circuit Evaluation: Faithfulness, Completeness, and Minimality https://learnmechinterp.com/topics/circuit-evaluation/ 2026-02-23T05:40:39Z

How to evaluate whether a discovered circuit is a faithful, complete, and minimal explanation of model behavior, with lessons from the IOI circuit's surprising components.

The IOI Circuit: Discovery and Mechanism https://learnmechinterp.com/topics/ioi-circuit/ 2026-02-23T05:40:39Z

How researchers reverse-engineered the complete algorithm GPT-2 Small uses for indirect object identification, discovering five classes of attention heads working in concert.

Induction Heads and In-Context Learning https://learnmechinterp.com/topics/induction-heads/ 2026-02-23T05:40:39Z

How the discovery of induction heads revealed a two-step circuit for in-context learning, demonstrating that compositional circuits emerge during training.

Activation Oracles https://learnmechinterp.com/topics/activation-oracles/ 2026-02-23T05:40:39Z

General-purpose activation explainers trained on diverse interpretation tasks, capable of recovering information from fine-tuned models and matching white-box baselines through natural language interrogation.

LatentQA and Latent Interpretation Tuning https://learnmechinterp.com/topics/latentqa/ 2026-02-23T05:40:39Z

Framing activation interpretation as question-answering: training decoder models on paired datasets of activations and Q&A to enable open-ended queries about what representations encode.

Training Models to Explain Their Computations https://learnmechinterp.com/topics/training-self-explanation/ 2026-02-23T05:40:39Z

Fine-tuning language models to generate natural language descriptions of their internal processes, with a key finding: models explain their own computations better than external models can.

SelfIE: Self-Interpretation of Embeddings https://learnmechinterp.com/topics/selfie-interpretation/ 2026-02-23T05:40:39Z

A method enabling language models to interpret their own hidden embeddings by converting them into natural language descriptions, with applications to understanding ethical reasoning, detecting prompt injection, and controlling model behavior.

Patchscopes https://learnmechinterp.com/topics/patchscopes/ 2026-02-23T05:40:39Z

A unifying framework for inspecting hidden representations by patching activations into target prompts designed to elicit natural language descriptions of their content.

Hidden State Decoding: From Vectors to Language https://learnmechinterp.com/topics/hidden-state-decoding-intro/ 2026-02-23T05:40:39Z

An introduction to LLM-based activation interpretation: using language models themselves to decode their hidden representations into human-readable natural language.

Finetuning Traces in Activations https://learnmechinterp.com/topics/finetuning-traces/ 2026-02-23T05:40:39Z

How narrow finetuning creates detectable biases in model activations that encode the finetuning domain, enabling steering, automated interpretation, and auditing of what a model was trained on.

Feature-Level Model Diffing https://learnmechinterp.com/topics/feature-level-model-diffing/ 2026-02-23T05:40:39Z

How crosscoders compare base and fine-tuned models at the feature level, what sparsity artifacts distort the results, and how BatchTopK crosscoders find genuinely chat-specific features.

Logit Diff Amplification https://learnmechinterp.com/topics/logit-diff-amplification/ 2026-02-23T05:40:39Z

How amplifying the logit-level differences between two model checkpoints can surface rare undesired behaviors that standard sampling would almost never find.

Crosscoders https://learnmechinterp.com/topics/crosscoders/ 2026-02-23T05:40:39Z

How crosscoders extend sparse autoencoders to train shared feature dictionaries across layers or models, enabling cross-layer feature tracking, circuit simplification, and model comparison.

Transcoders: Interpretable MLP Replacements https://learnmechinterp.com/topics/transcoders/ 2026-02-23T05:40:39Z

How transcoders replace opaque MLP layers with sparse interpretable alternatives, enabling feature-level circuit analysis of what MLPs compute.

Feature Geometry: Beyond One-Dimensional Directions https://learnmechinterp.com/topics/feature-geometry/ 2026-02-23T05:40:39Z

How categorical concepts form polytopes, periodic features trace circles, and hierarchical relationships map to orthogonal subspaces -- revealing that feature geometry in representation space is far richer than single directions.

SAE Variants, Evaluation, and Honest Limitations https://learnmechinterp.com/topics/sae-variants-and-evaluation/ 2026-02-23T05:40:39Z

The landscape of SAE architectures beyond vanilla L1 -- Gated, TopK, and JumpReLU SAEs -- how to evaluate them with SAEBench, and the honest limitations that remain unsolved.

Scaling Monosemanticity and Feature Steering https://learnmechinterp.com/topics/scaling-monosemanticity/ 2026-02-23T05:40:39Z

How scaling sparse autoencoders to millions of features revealed multilingual, multimodal, and abstract concepts -- and how clamping these features enables steering model behavior.

Feature Dashboards and Automated Interpretability https://learnmechinterp.com/topics/sae-interpretability/ 2026-02-23T05:40:39Z

How researchers visualize and interpret the features SAEs discover, from manual feature dashboards to automated LLM-based interpretation.

Sparse Autoencoders: Decomposing Superposition https://learnmechinterp.com/topics/sparse-autoencoders/ 2026-02-23T05:40:39Z

How sparse autoencoders learn an overcomplete dictionary of monosemantic features, decomposing the polysemantic activations that superposition creates.

Concept Erasure with LEACE https://learnmechinterp.com/topics/concept-erasure/ 2026-02-23T05:40:39Z

How LEACE provides a mathematically guaranteed method for erasing specific concepts from model representations, going beyond simple ablation with formal guarantees.

Localized Fact Editing and Its Pitfalls https://learnmechinterp.com/topics/fact-editing/ 2026-02-23T05:40:39Z

Techniques for editing specific facts in model weights via rank-one updates, why the insertion-versus-editing flaw undermines the approach, and what this teaches about interpretability rigor.

Multi-Layer Steering https://learnmechinterp.com/topics/multi-layer-steering/ 2026-02-23T05:40:39Z

How distributing steering interventions across multiple transformer layers overcomes the fragility of single-layer steering, from fixed depth schedules to principled layer selection to learned per-layer weights.

Unsupervised Steering Vectors https://learnmechinterp.com/topics/unsupervised-steering-vectors/ 2026-02-23T05:40:39Z

How optimization-based methods discover steering vectors without contrast pairs, finding latent behaviors the researcher did not anticipate by maximizing activation change across layers.

Function Vectors https://learnmechinterp.com/topics/function-vectors/ 2026-02-23T05:40:39Z

How in-context learning examples create natural directions in activation space that encode entire tasks, showing that models represent functions, not just features.

Representation Control https://learnmechinterp.com/topics/representation-control/ 2026-02-23T05:40:39Z

The unified framework for steering model behavior through interventions on internal representations, encompassing addition and ablation as complementary operations.

Affine Steering https://learnmechinterp.com/topics/affine-steering/ 2026-02-23T05:40:39Z

How combining directional ablation with an affine correction and tunable addition unifies addition and ablation steering into a single, more precise intervention.

Ablation Steering https://learnmechinterp.com/topics/ablation-steering/ 2026-02-23T05:40:39Z

How projecting out directions from model activations can disable specific behaviors, demonstrating causal necessity of concept representations.

Addition Steering https://learnmechinterp.com/topics/addition-steering/ 2026-02-23T05:40:39Z

How adding carefully computed steering vectors to model activations during inference can shift model behavior without any fine-tuning.

Probes in Production https://learnmechinterp.com/topics/probes-in-production/ 2026-02-23T05:40:39Z

How activation probes move from research tools to deployed safety systems, covering the distribution shift problem with long-context inputs, novel aggregation architectures like MultiMax, cascade designs that pair cheap probes with expensive classifiers, and the ensembling and training strategies that make probes reliable at scale.

Attention Probes https://learnmechinterp.com/topics/attention-probes/ 2026-02-23T05:40:39Z

How learned attention mechanisms inside probes solve the sequence aggregation problem, letting the probe decide which token positions matter for classification instead of relying on mean pooling or last-token heuristics.

Linear Artificial Tomography (LAT) https://learnmechinterp.com/topics/lat-probing/ 2026-02-23T05:40:39Z

How to read what concepts a model represents by training linear classifiers on activations, following the population-level approach from cognitive neuroscience.

Contrastive Activation Addition (CAA) https://learnmechinterp.com/topics/caa-method/ 2026-02-23T05:40:39Z

How to compute robust steering vectors by averaging activation differences across many contrast pairs, isolating the shared direction corresponding to a target concept.

Truthfulness Probing and the Geometry of Truth https://learnmechinterp.com/topics/truthfulness-probing/ 2026-02-23T05:40:39Z

How probing techniques reveal that truth and falsehood have linear geometric structure inside language models, from unsupervised truth discovery (CCS) to optimization-free difference-in-means probes, causal validation via intervention, and the probe-then-steer pipeline (ITI) that connects probing to steering.

Probing Classifiers https://learnmechinterp.com/topics/probing-classifiers/ 2026-02-23T05:40:39Z

How simple classifiers trained on model activations reveal what information is encoded in representations, from structural probes to MDL probing, and the fundamental gap between correlation and causation.

The Causal Abstraction Framework https://learnmechinterp.com/topics/causal-abstraction/ 2026-02-23T05:40:39Z

The theoretical framework that unifies activation patching, probing, circuit analysis, and other MI methods as special cases of one idea: testing whether a high-level causal model faithfully describes a neural network's computation via interchange interventions.

Self-Repair in Language Models https://learnmechinterp.com/topics/self-repair/ 2026-02-23T05:40:39Z

When ablating a model component, later components compensate and partially restore the original behavior. Understanding self-repair is essential for correctly interpreting ablation experiments and activation patching results.

Refined Attribution Methods https://learnmechinterp.com/topics/refined-attribution-methods/ 2026-02-23T05:40:39Z

How AtP*, EAP-IG, and EAP-GP fix the failure modes of gradient-based circuit discovery, from attention saturation and effect cancellation to zero-gradient regions.

Attribution Patching and Path Patching https://learnmechinterp.com/topics/attribution-patching/ 2026-02-23T05:40:39Z

Efficient gradient-based approximations to activation patching, and path patching for tracing information flow along specific edges in the computational graph.

Activation Patching and Causal Interventions https://learnmechinterp.com/topics/activation-patching/ 2026-02-23T05:40:39Z

The primary technique for establishing causal claims about model internals: replace an activation and measure what changes. Covers the clean/corrupted framework, noising vs denoising, choosing a metric, and interpreting results.

Reading the Attention Patterns https://learnmechinterp.com/topics/reading-attention-patterns/ 2026-02-23T05:40:39Z

How to visualize and interpret attention patterns to understand what information heads are moving, from previous token heads to induction heads.

Direct Logit Attribution https://learnmechinterp.com/topics/direct-logit-attribution/ 2026-02-23T05:40:39Z

How to decompose a model's output into per-component contributions by projecting each attention head's output onto the logit difference direction.

The Logit Lens and Tuned Lens https://learnmechinterp.com/topics/logit-lens-and-tuned-lens/ 2026-02-23T05:40:39Z

Vocabulary projection methods for reading model internals: the logit lens projects intermediate residual streams to vocabulary space, and the tuned lens corrects for basis changes across layers.

The Superposition Hypothesis https://learnmechinterp.com/topics/superposition/ 2026-02-23T05:40:39Z

How neural networks represent more features than dimensions by encoding them as nearly-orthogonal directions, why this makes interpretability hard, and what the toy model reveals about when superposition occurs.

The Linear Representation Hypothesis https://learnmechinterp.com/topics/linear-representation-hypothesis/ 2026-02-23T05:40:39Z

Why neural networks appear to represent concepts as linear directions in activation space, and why individual neurons fail as units of analysis.

What is Interpretability? https://learnmechinterp.com/topics/what-is-mech-interp/ 2026-02-23T05:40:39Z

The landscape of neural network interpretability approaches and the three core claims that define mechanistic interpretability as a field.

Decoding Strategies https://learnmechinterp.com/topics/decoding-strategies/ 2026-02-23T05:40:39Z

How transformer logits become text: greedy decoding, temperature scaling, top-k, nucleus sampling, and beam search, and why MI research mostly studies the forward pass directly.

Composition and Virtual Attention Heads https://learnmechinterp.com/topics/composition-and-virtual-heads/ 2026-02-23T05:40:39Z

How attention heads compose across layers through V-, K-, and Q-composition, creating virtual attention heads with capabilities no single head possesses.

QK and OV Circuits https://learnmechinterp.com/topics/qk-ov-circuits/ 2026-02-23T05:40:39Z

How attention heads decompose into independent QK (matching) and OV (copying) circuits through the low-rank factorization of weight matrices.

Layer Normalization https://learnmechinterp.com/topics/layer-normalization/ 2026-02-23T05:40:39Z

How layer normalization stabilizes transformer training, why it introduces a nonlinearity that complicates mechanistic interpretability, and the practical strategies researchers use to work around it.

MLPs in Transformers https://learnmechinterp.com/topics/mlps-in-transformers/ 2026-02-23T05:40:39Z

How feed-forward layers work as key-value memories that store factual knowledge, promote interpretable concepts in vocabulary space, and orchestrate a multi-stage pipeline for factual recall.

The Attention Mechanism https://learnmechinterp.com/topics/attention-mechanism/ 2026-02-23T05:40:39Z

How transformers enable tokens to communicate through queries, keys, and values: the attention equation, causal masking, multi-head attention, and the QK/OV decomposition that underlies mechanistic interpretability.

Transformer Architecture Intro https://learnmechinterp.com/topics/transformer-architecture/ 2026-02-23T05:40:39Z

A guided walkthrough of the full transformer stack: embeddings, attention, MLPs, layer norm, residual stream, and positional information.

Prerequisites https://learnmechinterp.com/topics/mi-prerequisites/ 2026-02-23T05:40:39Z

A short refresher on neural networks, backpropagation, and linear algebra notation used throughout the course.