<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Learn Mechanistic Interpretability</title>
  <subtitle>Structured articles covering transformer internals, interpretability techniques, and frontier research.</subtitle>
  <link href="https://learnmechinterp.com/feed.xml" rel="self" type="application/atom+xml"/>
  <link href="https://learnmechinterp.com/" rel="alternate" type="text/html"/>
  <updated>2026-02-23T05:40:39Z</updated>
  <id>https://learnmechinterp.com/</id>
  <author>
    <name>Learn MI</name>
  </author>
  <entry>
    <title>Getting Started in MI Research</title>
    <link href="https://learnmechinterp.com/topics/getting-started-in-mi-research/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/getting-started-in-mi-research/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A guide to Neel Nanda&#39;s opinionated roadmap for skilling up in mechanistic interpretability research: from learning the basics to publishing your first paper.</summary>
  </entry>
  <entry>
    <title>ARENA: Hands-On Technical Training</title>
    <link href="https://learnmechinterp.com/topics/arena/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/arena/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The Alignment Research Engineer Accelerator -- an intensive program and open curriculum for building practical mechanistic interpretability skills through coding exercises.</summary>
  </entry>
  <entry>
    <title>SAELens and Neuronpedia</title>
    <link href="https://learnmechinterp.com/topics/saelens-and-neuronpedia/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/saelens-and-neuronpedia/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The primary tools for working with sparse autoencoders in practice: SAELens for training, loading, and analyzing SAEs programmatically, and Neuronpedia for interactive exploration of SAE features.</summary>
  </entry>
  <entry>
    <title>nnsight and nnterp</title>
    <link href="https://learnmechinterp.com/topics/nnsight-and-nnterp/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/nnsight-and-nnterp/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Two complementary tools for interpretability research that work directly with HuggingFace models -- nnsight for flexible model inspection and intervention, and nnterp for a standardized interface across architectures.</summary>
  </entry>
  <entry>
    <title>TransformerLens</title>
    <link href="https://learnmechinterp.com/topics/transformerlens/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/transformerlens/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The standard Python library for mechanistic interpretability research -- its history, design philosophy, core abstractions, and practical usage.</summary>
  </entry>
  <entry>
    <title>Honest Limitations of MI for Safety</title>
    <link href="https://learnmechinterp.com/topics/mi-safety-limitations/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/mi-safety-limitations/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A candid assessment of what mechanistic interpretability cannot yet do for AI safety -- from interpretability illusions to scalability gaps -- and why closing these gaps matters.</summary>
  </entry>
  <entry>
    <title>Understanding Safety Mechanisms and MI-Based Monitoring</title>
    <link href="https://learnmechinterp.com/topics/safety-mechanisms-and-monitoring/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/safety-mechanisms-and-monitoring/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How MI tools enable monitoring model behavior through internal representations, from refusal mechanisms to misalignment detection, and the practical limitations of these approaches.</summary>
  </entry>
  <entry>
    <title>Deception Detection and Alignment Faking</title>
    <link href="https://learnmechinterp.com/topics/deception-detection/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/deception-detection/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The alignment faking threat and why behavioral evaluations fail to detect strategic deception, with early evidence that internal probes may succeed where behavior-based approaches cannot.</summary>
  </entry>
  <entry>
    <title>Detecting Sleeper Agents with Mechanistic Interpretability</title>
    <link href="https://learnmechinterp.com/topics/sleeper-agent-detection/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/sleeper-agent-detection/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How MI tools can detect artificially trained backdoor behaviors in language models, and the critical limitation that this does not extend to naturally arising deception.</summary>
  </entry>
  <entry>
    <title>The Refusal Direction</title>
    <link href="https://learnmechinterp.com/topics/refusal-direction/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/refusal-direction/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How a single direction in activation space controls whether a language model refuses harmful requests, and what this reveals about the geometry of safety behavior.</summary>
  </entry>
  <entry>
    <title>Counterfactual Resampling</title>
    <link href="https://learnmechinterp.com/topics/counterfactual-resampling/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/counterfactual-resampling/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A black-box technique for measuring which reasoning steps in a chain-of-thought trace actually influence the model&#39;s final answer, by deleting individual steps and comparing the resulting answer distributions.</summary>
  </entry>
  <entry>
    <title>Copy Suppression</title>
    <link href="https://learnmechinterp.com/topics/copy-suppression/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/copy-suppression/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>An algorithm pattern where attention heads suppress predictions of tokens that appear earlier in context, unifying previously mysterious head behaviors like negative name movers and anti-induction heads.</summary>
  </entry>
  <entry>
    <title>Multimodal Mechanistic Interpretability</title>
    <link href="https://learnmechinterp.com/topics/multimodal-mi/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/multimodal-mi/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Extending mechanistic interpretability beyond text to vision models, multimodal systems, and diffusion models -- what works, what breaks, and what remains unknown.</summary>
  </entry>
  <entry>
    <title>Universality Across Models</title>
    <link href="https://learnmechinterp.com/topics/universality/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/universality/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The evidence for and against the universality hypothesis -- whether different neural networks learn similar features and circuits -- and the metrics used to measure representation similarity.</summary>
  </entry>
  <entry>
    <title>Circuit Tracing and Attribution Graphs</title>
    <link href="https://learnmechinterp.com/topics/circuit-tracing/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/circuit-tracing/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How sparse feature circuits and attribution graphs enable tracing model computations at the feature level, from SAE-based circuits to Anthropic&#39;s Biology of an LLM approach.</summary>
  </entry>
  <entry>
    <title>Circuit Evaluation: Faithfulness, Completeness, and Minimality</title>
    <link href="https://learnmechinterp.com/topics/circuit-evaluation/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/circuit-evaluation/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How to evaluate whether a discovered circuit is a faithful, complete, and minimal explanation of model behavior, with lessons from the IOI circuit&#39;s surprising components.</summary>
  </entry>
  <entry>
    <title>The IOI Circuit: Discovery and Mechanism</title>
    <link href="https://learnmechinterp.com/topics/ioi-circuit/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/ioi-circuit/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How researchers reverse-engineered the complete algorithm GPT-2 Small uses for indirect object identification, discovering five classes of attention heads working in concert.</summary>
  </entry>
  <entry>
    <title>Induction Heads and In-Context Learning</title>
    <link href="https://learnmechinterp.com/topics/induction-heads/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/induction-heads/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How the discovery of induction heads revealed a two-step circuit for in-context learning, demonstrating that compositional circuits emerge during training.</summary>
  </entry>
  <entry>
    <title>Activation Oracles</title>
    <link href="https://learnmechinterp.com/topics/activation-oracles/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/activation-oracles/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>General-purpose activation explainers trained on diverse interpretation tasks, capable of recovering information from fine-tuned models and matching white-box baselines through natural language interrogation.</summary>
  </entry>
  <entry>
    <title>LatentQA and Latent Interpretation Tuning</title>
    <link href="https://learnmechinterp.com/topics/latentqa/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/latentqa/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Framing activation interpretation as question-answering: training decoder models on paired datasets of activations and Q&amp;A to enable open-ended queries about what representations encode.</summary>
  </entry>
  <entry>
    <title>Training Models to Explain Their Computations</title>
    <link href="https://learnmechinterp.com/topics/training-self-explanation/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/training-self-explanation/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Fine-tuning language models to generate natural language descriptions of their internal processes, with a key finding: models explain their own computations better than external models can.</summary>
  </entry>
  <entry>
    <title>SelfIE: Self-Interpretation of Embeddings</title>
    <link href="https://learnmechinterp.com/topics/selfie-interpretation/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/selfie-interpretation/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A method enabling language models to interpret their own hidden embeddings by converting them into natural language descriptions, with applications to understanding ethical reasoning, detecting prompt injection, and controlling model behavior.</summary>
  </entry>
  <entry>
    <title>Patchscopes</title>
    <link href="https://learnmechinterp.com/topics/patchscopes/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/patchscopes/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A unifying framework for inspecting hidden representations by patching activations into target prompts designed to elicit natural language descriptions of their content.</summary>
  </entry>
  <entry>
    <title>Hidden State Decoding: From Vectors to Language</title>
    <link href="https://learnmechinterp.com/topics/hidden-state-decoding-intro/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/hidden-state-decoding-intro/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>An introduction to LLM-based activation interpretation: using language models themselves to decode their hidden representations into human-readable natural language.</summary>
  </entry>
  <entry>
    <title>Finetuning Traces in Activations</title>
    <link href="https://learnmechinterp.com/topics/finetuning-traces/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/finetuning-traces/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How narrow finetuning creates detectable biases in model activations that encode the finetuning domain, enabling steering, automated interpretation, and auditing of what a model was trained on.</summary>
  </entry>
  <entry>
    <title>Feature-Level Model Diffing</title>
    <link href="https://learnmechinterp.com/topics/feature-level-model-diffing/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/feature-level-model-diffing/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How crosscoders compare base and fine-tuned models at the feature level, what sparsity artifacts distort the results, and how BatchTopK crosscoders find genuinely chat-specific features.</summary>
  </entry>
  <entry>
    <title>Logit Diff Amplification</title>
    <link href="https://learnmechinterp.com/topics/logit-diff-amplification/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/logit-diff-amplification/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How amplifying the logit-level differences between two model checkpoints can surface rare undesired behaviors that standard sampling would almost never find.</summary>
  </entry>
  <entry>
    <title>Crosscoders</title>
    <link href="https://learnmechinterp.com/topics/crosscoders/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/crosscoders/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How crosscoders extend sparse autoencoders to train shared feature dictionaries across layers or models, enabling cross-layer feature tracking, circuit simplification, and model comparison.</summary>
  </entry>
  <entry>
    <title>Transcoders: Interpretable MLP Replacements</title>
    <link href="https://learnmechinterp.com/topics/transcoders/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/transcoders/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How transcoders replace opaque MLP layers with sparse interpretable alternatives, enabling feature-level circuit analysis of what MLPs compute.</summary>
  </entry>
  <entry>
    <title>Feature Geometry: Beyond One-Dimensional Directions</title>
    <link href="https://learnmechinterp.com/topics/feature-geometry/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/feature-geometry/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How categorical concepts form polytopes, periodic features trace circles, and hierarchical relationships map to orthogonal subspaces -- revealing that feature geometry in representation space is far richer than single directions.</summary>
  </entry>
  <entry>
    <title>SAE Variants, Evaluation, and Honest Limitations</title>
    <link href="https://learnmechinterp.com/topics/sae-variants-and-evaluation/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/sae-variants-and-evaluation/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The landscape of SAE architectures beyond vanilla L1 -- Gated, TopK, and JumpReLU SAEs -- how to evaluate them with SAEBench, and the honest limitations that remain unsolved.</summary>
  </entry>
  <entry>
    <title>Scaling Monosemanticity and Feature Steering</title>
    <link href="https://learnmechinterp.com/topics/scaling-monosemanticity/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/scaling-monosemanticity/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How scaling sparse autoencoders to millions of features revealed multilingual, multimodal, and abstract concepts -- and how clamping these features enables steering model behavior.</summary>
  </entry>
  <entry>
    <title>Feature Dashboards and Automated Interpretability</title>
    <link href="https://learnmechinterp.com/topics/sae-interpretability/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/sae-interpretability/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How researchers visualize and interpret the features SAEs discover, from manual feature dashboards to automated LLM-based interpretation.</summary>
  </entry>
  <entry>
    <title>Sparse Autoencoders: Decomposing Superposition</title>
    <link href="https://learnmechinterp.com/topics/sparse-autoencoders/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/sparse-autoencoders/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How sparse autoencoders learn an overcomplete dictionary of monosemantic features, decomposing the polysemantic activations that superposition creates.</summary>
  </entry>
  <entry>
    <title>Concept Erasure with LEACE</title>
    <link href="https://learnmechinterp.com/topics/concept-erasure/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/concept-erasure/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How LEACE provides a mathematically guaranteed method for erasing specific concepts from model representations, going beyond simple ablation with formal guarantees.</summary>
  </entry>
  <entry>
    <title>Localized Fact Editing and Its Pitfalls</title>
    <link href="https://learnmechinterp.com/topics/fact-editing/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/fact-editing/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Techniques for editing specific facts in model weights via rank-one updates, why the insertion-versus-editing flaw undermines the approach, and what this teaches about interpretability rigor.</summary>
  </entry>
  <entry>
    <title>Multi-Layer Steering</title>
    <link href="https://learnmechinterp.com/topics/multi-layer-steering/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/multi-layer-steering/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How distributing steering interventions across multiple transformer layers overcomes the fragility of single-layer steering, from fixed depth schedules to principled layer selection to learned per-layer weights.</summary>
  </entry>
  <entry>
    <title>Unsupervised Steering Vectors</title>
    <link href="https://learnmechinterp.com/topics/unsupervised-steering-vectors/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/unsupervised-steering-vectors/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How optimization-based methods discover steering vectors without contrast pairs, finding latent behaviors the researcher did not anticipate by maximizing activation change across layers.</summary>
  </entry>
  <entry>
    <title>Function Vectors</title>
    <link href="https://learnmechinterp.com/topics/function-vectors/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/function-vectors/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How in-context learning examples create natural directions in activation space that encode entire tasks, showing that models represent functions, not just features.</summary>
  </entry>
  <entry>
    <title>Representation Control</title>
    <link href="https://learnmechinterp.com/topics/representation-control/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/representation-control/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The unified framework for steering model behavior through interventions on internal representations, encompassing addition and ablation as complementary operations.</summary>
  </entry>
  <entry>
    <title>Affine Steering</title>
    <link href="https://learnmechinterp.com/topics/affine-steering/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/affine-steering/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How combining directional ablation with an affine correction and tunable addition unifies addition and ablation steering into a single, more precise intervention.</summary>
  </entry>
  <entry>
    <title>Ablation Steering</title>
    <link href="https://learnmechinterp.com/topics/ablation-steering/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/ablation-steering/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How projecting out directions from model activations can disable specific behaviors, demonstrating causal necessity of concept representations.</summary>
  </entry>
  <entry>
    <title>Addition Steering</title>
    <link href="https://learnmechinterp.com/topics/addition-steering/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/addition-steering/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How adding carefully computed steering vectors to model activations during inference can shift model behavior without any fine-tuning.</summary>
  </entry>
  <entry>
    <title>Probes in Production</title>
    <link href="https://learnmechinterp.com/topics/probes-in-production/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/probes-in-production/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How activation probes move from research tools to deployed safety systems, covering the distribution shift problem with long-context inputs, novel aggregation architectures like MultiMax, cascade designs that pair cheap probes with expensive classifiers, and the ensembling and training strategies that make probes reliable at scale.</summary>
  </entry>
  <entry>
    <title>Attention Probes</title>
    <link href="https://learnmechinterp.com/topics/attention-probes/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/attention-probes/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How learned attention mechanisms inside probes solve the sequence aggregation problem, letting the probe decide which token positions matter for classification instead of relying on mean pooling or last-token heuristics.</summary>
  </entry>
  <entry>
    <title>Linear Artificial Tomography (LAT)</title>
    <link href="https://learnmechinterp.com/topics/lat-probing/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/lat-probing/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How to read what concepts a model represents by training linear classifiers on activations, following the population-level approach from cognitive neuroscience.</summary>
  </entry>
  <entry>
    <title>Contrastive Activation Addition (CAA)</title>
    <link href="https://learnmechinterp.com/topics/caa-method/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/caa-method/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How to compute robust steering vectors by averaging activation differences across many contrast pairs, isolating the shared direction corresponding to a target concept.</summary>
  </entry>
  <entry>
    <title>Truthfulness Probing and the Geometry of Truth</title>
    <link href="https://learnmechinterp.com/topics/truthfulness-probing/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/truthfulness-probing/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How probing techniques reveal that truth and falsehood have linear geometric structure inside language models, from unsupervised truth discovery (CCS) to optimization-free difference-in-means probes, causal validation via intervention, and the probe-then-steer pipeline (ITI) that connects probing to steering.</summary>
  </entry>
  <entry>
    <title>Probing Classifiers</title>
    <link href="https://learnmechinterp.com/topics/probing-classifiers/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/probing-classifiers/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How simple classifiers trained on model activations reveal what information is encoded in representations, from structural probes to MDL probing, and the fundamental gap between correlation and causation.</summary>
  </entry>
  <entry>
    <title>The Causal Abstraction Framework</title>
    <link href="https://learnmechinterp.com/topics/causal-abstraction/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/causal-abstraction/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The theoretical framework that unifies activation patching, probing, circuit analysis, and other MI methods as special cases of one idea: testing whether a high-level causal model faithfully describes a neural network&#39;s computation via interchange interventions.</summary>
  </entry>
  <entry>
    <title>Self-Repair in Language Models</title>
    <link href="https://learnmechinterp.com/topics/self-repair/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/self-repair/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>When ablating a model component, later components compensate and partially restore the original behavior. Understanding self-repair is essential for correctly interpreting ablation experiments and activation patching results.</summary>
  </entry>
  <entry>
    <title>Refined Attribution Methods</title>
    <link href="https://learnmechinterp.com/topics/refined-attribution-methods/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/refined-attribution-methods/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How AtP*, EAP-IG, and EAP-GP fix the failure modes of gradient-based circuit discovery, from attention saturation and effect cancellation to zero-gradient regions.</summary>
  </entry>
  <entry>
    <title>Attribution Patching and Path Patching</title>
    <link href="https://learnmechinterp.com/topics/attribution-patching/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/attribution-patching/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Efficient gradient-based approximations to activation patching, and path patching for tracing information flow along specific edges in the computational graph.</summary>
  </entry>
  <entry>
    <title>Activation Patching and Causal Interventions</title>
    <link href="https://learnmechinterp.com/topics/activation-patching/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/activation-patching/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The primary technique for establishing causal claims about model internals: replace an activation and measure what changes. Covers the clean/corrupted framework, noising vs denoising, choosing a metric, and interpreting results.</summary>
  </entry>
  <entry>
    <title>Reading the Attention Patterns</title>
    <link href="https://learnmechinterp.com/topics/reading-attention-patterns/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/reading-attention-patterns/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How to visualize and interpret attention patterns to understand what information heads are moving, from previous token heads to induction heads.</summary>
  </entry>
  <entry>
    <title>Direct Logit Attribution</title>
    <link href="https://learnmechinterp.com/topics/direct-logit-attribution/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/direct-logit-attribution/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How to decompose a model&#39;s output into per-component contributions by projecting each attention head&#39;s output onto the logit difference direction.</summary>
  </entry>
  <entry>
    <title>The Logit Lens and Tuned Lens</title>
    <link href="https://learnmechinterp.com/topics/logit-lens-and-tuned-lens/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/logit-lens-and-tuned-lens/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Vocabulary projection methods for reading model internals: the logit lens projects intermediate residual streams to vocabulary space, and the tuned lens corrects for basis changes across layers.</summary>
  </entry>
  <entry>
    <title>The Superposition Hypothesis</title>
    <link href="https://learnmechinterp.com/topics/superposition/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/superposition/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How neural networks represent more features than dimensions by encoding them as nearly-orthogonal directions, why this makes interpretability hard, and what the toy model reveals about when superposition occurs.</summary>
  </entry>
  <entry>
    <title>The Linear Representation Hypothesis</title>
    <link href="https://learnmechinterp.com/topics/linear-representation-hypothesis/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/linear-representation-hypothesis/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>Why neural networks appear to represent concepts as linear directions in activation space, and why individual neurons fail as units of analysis.</summary>
  </entry>
  <entry>
    <title>What is Interpretability?</title>
    <link href="https://learnmechinterp.com/topics/what-is-mech-interp/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/what-is-mech-interp/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>The landscape of neural network interpretability approaches and the three core claims that define mechanistic interpretability as a field.</summary>
  </entry>
  <entry>
    <title>Decoding Strategies</title>
    <link href="https://learnmechinterp.com/topics/decoding-strategies/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/decoding-strategies/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How transformer logits become text: greedy decoding, temperature scaling, top-k, nucleus sampling, and beam search, and why MI research mostly studies the forward pass directly.</summary>
  </entry>
  <entry>
    <title>Composition and Virtual Attention Heads</title>
    <link href="https://learnmechinterp.com/topics/composition-and-virtual-heads/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/composition-and-virtual-heads/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How attention heads compose across layers through V-, K-, and Q-composition, creating virtual attention heads with capabilities no single head possesses.</summary>
  </entry>
  <entry>
    <title>QK and OV Circuits</title>
    <link href="https://learnmechinterp.com/topics/qk-ov-circuits/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/qk-ov-circuits/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How attention heads decompose into independent QK (matching) and OV (copying) circuits through the low-rank factorization of weight matrices.</summary>
  </entry>
  <entry>
    <title>Layer Normalization</title>
    <link href="https://learnmechinterp.com/topics/layer-normalization/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/layer-normalization/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How layer normalization stabilizes transformer training, why it introduces a nonlinearity that complicates mechanistic interpretability, and the practical strategies researchers use to work around it.</summary>
  </entry>
  <entry>
    <title>MLPs in Transformers</title>
    <link href="https://learnmechinterp.com/topics/mlps-in-transformers/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/mlps-in-transformers/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How feed-forward layers work as key-value memories that store factual knowledge, promote interpretable concepts in vocabulary space, and orchestrate a multi-stage pipeline for factual recall.</summary>
  </entry>
  <entry>
    <title>The Attention Mechanism</title>
    <link href="https://learnmechinterp.com/topics/attention-mechanism/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/attention-mechanism/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>How transformers enable tokens to communicate through queries, keys, and values: the attention equation, causal masking, multi-head attention, and the QK/OV decomposition that underlies mechanistic interpretability.</summary>
  </entry>
  <entry>
    <title>Transformer Architecture Intro</title>
    <link href="https://learnmechinterp.com/topics/transformer-architecture/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/transformer-architecture/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A guided walkthrough of the full transformer stack: embeddings, attention, MLPs, layer norm, residual stream, and positional information.</summary>
  </entry>
  <entry>
    <title>Prerequisites</title>
    <link href="https://learnmechinterp.com/topics/mi-prerequisites/" rel="alternate" type="text/html"/>
    <id>https://learnmechinterp.com/topics/mi-prerequisites/</id>
    <updated>2026-02-23T05:40:39Z</updated>
    <summary>A short refresher on neural networks, backpropagation, and linear algebra notation used throughout the course.</summary>
  </entry>
</feed>
