MI Glossary

Key terms in mechanistic interpretability with links to articles where they are discussed.

A

Activation Cache: A dictionary storing all intermediate activations from a model's forward pass, keyed by HookPoint name. Enables post-hoc inspection of every computation the model performed on a given input.See:TransformerLens
Activation Difference Lens: A model diffing technique that interprets the average activation difference between a finetuned model and its base model on early tokens of unrelated text, using tools like Patchscope and steering to reveal information about the finetuning domain.See:Finetuning Traces in Activations
Activation Patching: A causal intervention method where activations from a clean run are substituted into a corrupted run (or vice versa) at specific model components, revealing which components are causally important for a behavior.See:Activation Patching and Causal Interventions
Affine Concept Editing (ACE): A steering intervention that erases the component along a concept direction, re-centers at the null-behavior mean, and adds a tunable amount of the direction back. Generalizes both addition steering and directional ablation as special cases.See:Affine Steering
Alignment Faking: A scenario where a model behaves as if aligned during training or evaluation while harboring different objectives internally. Detecting alignment faking is a key motivation for mechanistic interpretability safety research.See:Deception Detection and Alignment Faking
Attention Head: An individual attention computation within a multi-head attention layer. Each head independently computes attention patterns over the input sequence and produces a weighted combination of value vectors.See:The Attention Mechanism
Attention Pattern: The matrix of attention weights produced by an attention head, showing how much each token position attends to every other position. Visualizing attention patterns is a foundational interpretability technique.See:The Attention Mechanism
Attention Probe: A probing classifier that uses a learned attention mechanism to aggregate per-token hidden states into a single representation for classification, replacing fixed pooling strategies like mean pooling or last-token selection.See:Attention Probes
Attribution Graph: A computational graph produced by circuit tracing that maps how information flows through a model, showing which features and connections contribute to a specific output.See:Circuit Tracing and Attribution Graphs
Attribution Patching: A linearized approximation of activation patching that uses gradients to estimate the causal effect of patching each component, making it computationally feasible to scan all components in a single forward and backward pass.See:Attribution Patching and Path Patching
Automated Interpretability: Methods that use language models to automatically generate and score natural language explanations of what individual neurons or features represent, reducing the need for manual inspection.See:Feature Dashboards and Automated Interpretability

C

Cascade Classifier: A two-stage deployment architecture where a cheap, fast classifier (such as a linear probe) screens all traffic and only escalates uncertain or flagged cases to a more expensive classifier (such as an LLM), dramatically reducing average inference cost while maintaining accuracy.See:Probes in Production
Causal Abstraction: A formal relationship between a high-level interpretable causal model and a low-level neural network, established by showing that interventions on aligned components produce matching behavior changes in both systems.See:The Causal Abstraction Framework
Causal Intervention: Any experimental technique that actively modifies model internals (activations, weights, or attention patterns) to test causal hypotheses about how a model computes its outputs, as opposed to purely observational analysis.See:Activation Patching and Causal Interventions
Circuit (neural): A subgraph of a neural network consisting of specific components (attention heads, MLP neurons, or features) and their connections that together implement an identifiable computational mechanism.See:The IOI Circuit: Discovery and Mechanism
Circuit Tracing: A methodology developed by Anthropic for mapping information flow through neural networks by decomposing computations into interpretable features (via SAEs or transcoders) and tracing their causal connections.See:Circuit Tracing and Attribution Graphs
Completeness (circuit): A circuit evaluation criterion measuring whether the identified circuit accounts for all of the model's performance on a task. A complete circuit captures all relevant computation, with no important components left out.See:Circuit Evaluation: Faithfulness, Completeness, and Minimality
Composition (of attention heads): The mechanism by which attention heads in different layers interact through the residual stream, where earlier heads write information that later heads read. Three types exist: Q-composition, K-composition, and V-composition.See:Composition and Virtual Attention Heads
Concept Erasure: A technique for removing specific concepts from model representations by projecting activations onto the orthogonal complement of the concept's subspace, making the concept linearly unreadable from the modified representations.See:Concept Erasure with LEACE
Contrast-Consistent Search (CCS): An unsupervised probing method that identifies truth directions in activation space without labeled data, by learning a probe whose outputs on a statement and its negation are consistent (summing to one) and confident (away from 0.5).See:Truthfulness Probing and the Geometry of Truth
Copy Suppression: An attention head algorithm pattern where the head attends to positions where a predicted token appeared earlier in context and outputs the negative of that token's unembedding direction, suppressing the model's tendency to predict tokens it has already seen.See:Copy Suppression
Counterfactual Resampling: A black-box technique for measuring the causal importance of individual reasoning steps: delete a step from a chain-of-thought trace, regenerate from that point many times, and measure the distributional shift in final answers via KL divergence.See:Counterfactual Resampling
Crosscoder: A variant of sparse autoencoders trained jointly on activations from multiple models (or the same model at different training stages), learning a shared feature dictionary that enables direct comparison of representations across models.See:Crosscoders

D

Dead Features: Features in a trained sparse autoencoder that never activate on any input in the dataset. Dead features represent wasted capacity and are a common training challenge for SAEs, addressed by techniques such as resampling.See:SAE Variants, Evaluation, and Honest Limitations
Deception Detection: The application of mechanistic interpretability to identify when a model is generating outputs that conflict with its internal representations, potentially indicating deceptive or unfaithful behavior.See:Deception Detection and Alignment Faking
Depth Schedule: A function assigning a per-layer steering weight across all layers of a model, distributing the intervention across depth rather than concentrating it at a single layer.See:Multi-Layer Steering
Dictionary Learning: A class of methods that learn an overcomplete set of basis vectors (a dictionary) to represent data as sparse combinations. In MI, dictionary learning via sparse autoencoders is used to decompose superposed neural network activations into interpretable features.See:Sparse Autoencoders: Decomposing Superposition
Direct Logit Attribution (DLA): An interpretability technique that decomposes a model's output logits into additive contributions from each component (attention heads and MLP layers) by projecting their residual stream writes onto the unembedding direction for a token of interest.See:Direct Logit Attribution

E

EAP-IG: Edge Attribution Patching with Integrated Gradients. Replaces the single gradient evaluation in EAP with an average of gradients along the interpolation path from corrupted to clean activations, fixing zero-gradient failures and improving circuit faithfulness.See:Refined Attribution Methods
Embedding: The mapping from discrete tokens to continuous vectors at the start of a transformer. The embedding matrix converts each token into a vector in the residual stream, where it can be read and modified by subsequent layers.See:The Attention Mechanism

F

Faithfulness (circuit): A circuit evaluation criterion measuring how well the circuit reproduces the full model's behavior when run in isolation. A faithful circuit produces similar outputs to the complete model on the target task.See:Circuit Evaluation: Faithfulness, Completeness, and Minimality
Feature (in MI): A unit of neural network computation that represents a meaningful concept or pattern. In the context of superposition and SAEs, a feature is a direction in activation space corresponding to an interpretable property of the input.See:The Superposition Hypothesis
Feature Absorption: A failure mode in sparse autoencoders where a feature absorbs activation patterns that should be captured by other features, reducing the fidelity of the learned decomposition and making some features appear more general than they should be.See:SAE Variants, Evaluation, and Honest Limitations
Feature Dashboard: A visualization tool that displays the top-activating dataset examples, logit effects, and other statistics for individual SAE features, helping researchers assess whether a feature corresponds to an interpretable concept.See:Feature Dashboards and Automated Interpretability
Feature Splitting: The phenomenon where a single feature in a smaller SAE splits into multiple, more specific features when the SAE dictionary size is increased, revealing finer-grained structure in model representations.See:Scaling Monosemanticity and Feature Steering
Feature Steering: A technique for controlling model behavior by artificially amplifying or suppressing specific SAE features during inference, effectively pushing model outputs toward or away from concepts those features represent.See:Scaling Monosemanticity and Feature Steering
Function Vector: A direction in activation space that encodes an input-output function (such as 'translate English to French' or 'convert to past tense') rather than a static concept, enabling task transfer when added to unrelated prompts.See:Function Vectors

G

Gated SAE: A sparse autoencoder architecture that separates the decision of whether a feature is active from the estimation of its magnitude, using a gating mechanism that reduces shrinkage bias present in standard L1-regularized SAEs.See:SAE Variants, Evaluation, and Honest Limitations

H

Hierarchical Orthogonality: The geometric property where a parent concept's representation vector is orthogonal to the difference vector between a child concept and the parent. This ensures that manipulating the parent (e.g., 'animal') does not shift the relative probabilities among children (e.g., 'mammal' vs. 'bird').See:Feature Geometry: Beyond One-Dimensional Directions
HookPoint: A named location in a TransformerLens model where activations can be intercepted, read, or modified during a forward pass. Every intermediate computation (attention patterns, residual stream states, MLP outputs) has a corresponding HookPoint.See:TransformerLens

I

In-Context Learning: The ability of large language models to learn new tasks from examples provided in the prompt without weight updates. Induction heads are a key mechanism underlying this capability, performing pattern matching across the context window.See:Induction Heads and In-Context Learning
Induction Head: A two-attention-head circuit that implements a simple copying mechanism: when the model sees a pattern like 'A B ... A', the induction head predicts 'B' will follow. Induction heads are a primary mechanism for in-context learning in transformers.See:Induction Heads and In-Context Learning
Inference-Time Intervention (ITI): A technique that improves model truthfulness at inference time by shifting activations along truth-correlated directions identified via probing, implementing a probe-then-steer pipeline.See:Truthfulness Probing and the Geometry of Truth
Interchange Intervention Accuracy (IIA): The proportion of interchange interventions on which the neural network's output matches the prediction of the high-level causal model. Measures how faithfully the causal model describes the network's computation.See:The Causal Abstraction Framework
Interpretability Illusion: The risk that an interpretability method appears to provide a correct explanation of model behavior but actually misses the true mechanism, giving researchers false confidence in their understanding of the model.See:SAE Variants, Evaluation, and Honest Limitations
Intervention Graph: A portable, serializable representation of a set of model interventions in nnsight. The intervention graph decouples the experimental design from model deployment, enabling the same experiment to run locally or on remote infrastructure.See:nnsight and nnterp
IOI Circuit: The circuit discovered in GPT-2 Small that performs the Indirect Object Identification task, consisting of name movers, backup name movers, S-inhibition heads, induction-like heads, and duplicate token heads working together to predict the correct indirect object.See:The IOI Circuit: Discovery and Mechanism
Irreducible Multi-Dimensional Feature: A feature that occupies more than one dimension in activation space and cannot be decomposed into independent one-dimensional features. Days of the week, for instance, form a circle in a 2D subspace where the two dimensions are coupled, not separable.See:Feature Geometry: Beyond One-Dimensional Directions

K

Key Vector: The vector produced by applying the key weight matrix (W_K) to a token's representation. Key vectors are compared against query vectors via dot product to determine attention weights.See:The Attention Mechanism
Key-Value Memory (MLP): An interpretation of feed-forward layers where each neuron in the hidden layer has a key vector (a row of the input projection) that matches input patterns and a value vector (a column of the output projection) that promotes specific tokens or concepts in the output vocabulary.See:MLPs in Transformers
Knowledge Editing: Techniques for modifying specific factual associations stored in a language model's weights without retraining, typically by making targeted rank-one updates to MLP layers identified as causally responsible for the fact.See:Localized Fact Editing and Its Pitfalls
Knowledge Neuron: An MLP neuron whose activation is causally linked to the expression of a specific factual association, such that suppressing it degrades and amplifying it strengthens the model's recall of that fact.See:MLPs in Transformers

L

Latent Scaling: A diagnostic technique for crosscoders that measures how well a supposedly model-specific latent can explain activations in both models, detecting false attributions caused by L1 sparsity artifacts.See:Feature-Level Model Diffing
Layer Normalization: A normalization technique that rescales activations within each token's representation vector to have zero mean and unit variance, then applies learned affine parameters. Applied before each sublayer in pre-norm transformers, it stabilizes training but introduces a nonlinearity that couples all residual stream dimensions.See:Layer Normalization
LEACE: Least-squares Concept Erasure: a closed-form method for removing linear information about a concept from model representations by computing and projecting out the optimal linear subspace, guaranteeing that no linear classifier can recover the erased concept.See:Concept Erasure with LEACE
Linear Probe: A simple linear classifier trained on frozen model activations to test whether specific information (such as part of speech or sentiment) is linearly accessible at a given layer, providing evidence about what representations a model has learned.See:Probing Classifiers
Linear Representation Hypothesis: The hypothesis that neural networks represent concepts as linear directions in activation space, so that adding or subtracting these directions corresponds to adding or removing the associated concept.See:The Linear Representation Hypothesis
Logit Diff Amplification (LDA): A technique for surfacing rare model behaviors by sampling from a distribution that amplifies the logit-level differences between two model checkpoints (e.g., before and after fine-tuning), making training-induced behavioral changes more frequent and easier to detect.See:Logit Diff Amplification
Logit Lens: An observational technique that applies the model's unembedding matrix to intermediate residual stream states, converting hidden representations into vocabulary-space predictions to see how the model's output evolves across layers.See:The Logit Lens and Tuned Lens

M

Mechanistic Interpretability: A subfield of AI safety research focused on reverse-engineering the internal computations of neural networks to understand how they process information and produce outputs, moving beyond behavioral analysis to study the mechanisms themselves.See:What is Interpretability?
MELBO: Mechanistically Eliciting Latent Behaviors in language mOdels. An unsupervised method that discovers steering vectors by optimizing perturbations at an early layer to maximize activation change at a later layer, requiring no labeled examples or contrast pairs.See:Unsupervised Steering Vectors
Minimality (circuit): A circuit evaluation criterion measuring whether the circuit contains only components that are necessary for the task. A minimal circuit has no redundant parts whose removal would leave performance unchanged.See:Circuit Evaluation: Faithfulness, Completeness, and Minimality
MLP Layer: The feedforward sublayer in a transformer block, consisting of two linear projections with a nonlinearity between them. MLP layers process each token position independently and are believed to store factual knowledge and perform feature transformations.See:Direct Logit Attribution
Model Diffing: The practice of comparing internal representations between two related models (such as a base model and a fine-tuned version) to identify which features or circuits changed, using tools like crosscoders.See:Feature-Level Model Diffing
Monosemanticity: The property of a neuron or feature activating for exactly one interpretable concept. Monosemantic units are the goal of dictionary learning methods like SAEs, which aim to decompose polysemantic neurons into monosemantic features.See:Sparse Autoencoders: Decomposing Superposition
Multi-Head Attention: The mechanism of running multiple independent attention heads in parallel within a single layer, allowing the model to attend to different types of relationships simultaneously and combine their outputs.See:The Attention Mechanism
Multimodal Interpretability: The application of mechanistic interpretability techniques to models that process multiple input modalities (such as vision and language), investigating how representations are shared or transformed across modalities.See:Multimodal Mechanistic Interpretability

N

Name Mover Head: An attention head in the IOI circuit that attends to the indirect object name and copies it to the final token position, directly promoting that name in the output logits. Name movers are the output stage of the IOI circuit.See:The IOI Circuit: Discovery and Mechanism
Nucleus Sampling (top-p): A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k, it adapts the number of candidate tokens to the shape of the distribution, including fewer tokens when the model is confident and more when it is uncertain.See:Decoding Strategies

O

OV Circuit: The component of an attention head formed by the product of the value (W_V) and output (W_O) weight matrices. The OV circuit determines what information is written to the residual stream when a token is attended to.See:QK and OV Circuits
Overcomplete Basis: A set of basis vectors that is larger than the dimensionality of the space it spans. In SAEs, an overcomplete basis allows the dictionary to have more features than the model's hidden dimension, enabling it to represent the many features packed into superposition.See:Sparse Autoencoders: Decomposing Superposition

P

Path Patching: A refined variant of activation patching that isolates the effect of a specific computational path between two components, controlling for all other paths. This enables precise attribution of behavior to individual connections in a circuit.See:Attribution Patching and Path Patching
Polysemanticity: The property of a single neuron responding to multiple unrelated concepts. Polysemanticity is a consequence of superposition, where models encode more features than they have neurons by sharing neurons across features.See:The Superposition Hypothesis
Polytope Representation: The convex hull formed by the vector representations of a categorical concept's values in activation space. A binary concept forms a line segment, a ternary concept forms a triangle, and a k-valued concept forms a (k-1)-simplex.See:Feature Geometry: Beyond One-Dimensional Directions
Previous Token Head: An attention head that consistently attends to the immediately preceding token. Previous token heads are one component of the induction circuit, copying positional information that induction heads use to complete repeated patterns.See:Induction Heads and In-Context Learning
Probing Classifier: A simple model (typically linear) trained on neural network activations to predict properties of the input, used as a diagnostic tool to test what information is encoded at different layers of a network.See:Probing Classifiers

Q

QK Circuit: The component of an attention head formed by the product of the query (W_Q) and key (W_K) weight matrices. The QK circuit determines which tokens attend to which other tokens by computing attention scores.See:QK and OV Circuits
Query Vector: The vector produced by applying the query weight matrix (W_Q) to a token's representation. Query vectors are compared against key vectors to compute attention scores that determine how much each position attends to others.See:The Attention Mechanism

R

Refusal Direction: A specific direction in a model's activation space that mediates refusal behavior. When this direction is removed or suppressed, the model stops refusing harmful requests, demonstrating that safety training creates a simple, linear mechanism.See:The Refusal Direction
Representation Reading: The practice of extracting information about a model's internal state by training classifiers on its activations, used in safety contexts to detect when a model may be reasoning about deception or harmful content.See:Understanding Safety Mechanisms and MI-Based Monitoring
Residual Stream: The central communication channel in a transformer, implemented as skip connections that allow each layer's output to be added to a running sum. All attention heads and MLP layers read from and write to this shared stream.See:Transformer Architecture Intro
RMSNorm: A simplified variant of layer normalization that normalizes by the root mean square of activations without centering by the mean. Used in LLaMA, Gemma, and other modern architectures for its computational efficiency and comparable performance.See:Layer Normalization
ROME: Rank-One Model Editing: a method for editing factual associations by performing a rank-one update to a specific MLP layer's weights, modifying the key-value mapping for a targeted fact while attempting to preserve other knowledge.See:Localized Fact Editing and Its Pitfalls

S

Safety Monitor: A system that uses mechanistic interpretability techniques (such as probes or feature monitors) to detect potentially dangerous model behaviors at inference time, enabling intervention before harmful outputs are produced.See:Understanding Safety Mechanisms and MI-Based Monitoring
Self-Repair: The phenomenon where ablating or patching a model component causes later components to compensate, partially restoring the original behavior. Self-repair means that ablation effects systematically understate component importance.See:Self-Repair in Language Models
Sleeper Agent: A model with a hidden backdoor that behaves normally under standard conditions but activates harmful behavior when a specific trigger is present. Detecting sleeper agents is a motivating application of MI for safety.See:Detecting Sleeper Agents with Mechanistic Interpretability
Sparse Autoencoder (SAE): A dictionary learning method that decomposes a model's activations into a larger set of sparse, interpretable features. SAEs learn an overcomplete basis where each feature ideally corresponds to one human-understandable concept.See:Sparse Autoencoders: Decomposing Superposition
Superposition: The phenomenon where neural networks represent more features than they have dimensions by encoding features as nearly orthogonal directions in activation space, allowing models to store more concepts than their parameter count would naively permit.See:The Superposition Hypothesis

T

Temperature (sampling): A hyperparameter that scales logits before the softmax during text generation. Temperature below 1 sharpens the distribution (more deterministic), temperature above 1 flattens it (more random), and temperature approaching 0 recovers greedy decoding.See:Decoding Strategies
Thought Anchor: A reasoning step with disproportionately high counterfactual importance, meaning the model's final answer distribution changes substantially when that step is removed. Plan generation and uncertainty management steps tend to be thought anchors.See:Counterfactual Resampling
Tracing Context: A Python context manager in nnsight where code is captured rather than executed immediately. Operations on model internals within a tracing context build up an intervention graph that is executed as a batch when the context exits.See:nnsight and nnterp
Transcoder: A sparse autoencoder variant applied to MLP layers that maps from MLP inputs to MLP outputs, learning interpretable features that describe what transformations the MLP performs rather than what it represents.See:Transcoders: Interpretable MLP Replacements
Tuned Lens: An improvement on the logit lens that trains a learned affine transformation at each layer (rather than reusing the final unembedding matrix), producing more accurate predictions of the model's evolving computation at intermediate layers.See:The Logit Lens and Tuned Lens

U

Unembedding: The mapping from the final residual stream representation back to vocabulary logits at the end of a transformer. The unembedding matrix converts continuous vectors into token probabilities, and is central to techniques like the logit lens and direct logit attribution.See:The Logit Lens and Tuned Lens
Universality: The hypothesis that different neural networks trained on similar tasks converge on similar internal representations and circuits, suggesting that certain computational solutions are natural or optimal for given problems.See:Universality Across Models

V

Value Vector: The vector produced by applying the value weight matrix (W_V) to a token's representation. Value vectors carry the content information that gets written to the residual stream, weighted by the attention pattern.See:The Attention Mechanism
Virtual Attention Head: An emergent attention head that does not correspond to any single physical head in the model but arises from the composition of two or more heads across different layers communicating through the residual stream.See:Composition and Virtual Attention Heads