What This Refresher Covers

Mechanistic interpretability assumes comfort with a few core ideas from deep learning and linear algebra. This page is a quick refresh, not a full tutorial. If these concepts are familiar, you can skip ahead.

Neural Networks in 60 Seconds

A neural network composes linear maps with nonlinearities. For a single layer:

h=f(xW+b)\mathbf{h} = f(\mathbf{x} W + \mathbf{b})

  • x\mathbf{x} is the input vector.
  • WW is a learned weight matrix.
  • b\mathbf{b} is a bias vector.
  • ff is a nonlinear activation (ReLU, GELU, etc.).

Stacking many layers produces a deep function that can represent complex behaviors.

Training and Backpropagation

Training adjusts weights to reduce a loss function L\mathcal{L}. The loop is:

  1. Forward pass: compute predictions.
  2. Loss: measure error.
  3. Backward pass: compute gradients WL\nabla_W \mathcal{L}.
  4. Update: WWηWLW \leftarrow W - \eta \nabla_W \mathcal{L}.

Mechanistic interpretability often treats a trained model as fixed and asks: what computation does this trained function implement?

Linear Algebra You Will See Constantly

Vectors as directions. A vector vRn\mathbf{v} \in \mathbb{R}^n is a point or direction in nn-dimensional space. In transformers, activations and embeddings are vectors.

Matrices as linear maps. A matrix WW maps vectors between spaces: W:RnRmW : \mathbb{R}^n \to \mathbb{R}^m. Most transformer computations are matrix multiplications.

Dot product. The dot product ab\mathbf{a} \cdot \mathbf{b} measures similarity between two vectors. Attention scores are dot products between queries and keys.

Subspaces and projections. A subspace is a set of directions closed under addition and scaling. A projection removes the component of a vector along a direction. Many MI methods interpret features as directions and reason about projections onto them.

Notation Quick Reference

  • Bold lowercase: vectors x\mathbf{x}, r\mathbf{r}
  • Bold uppercase: matrices WW, WQW_Q, WKW_K, WVW_V
  • Residual stream: rl\mathbf{r}^l for layer ll
  • Attention head output: rl,h\mathbf{r}^{l,h} for head hh in layer ll

If this all feels easy, you are ready for the transformer foundations articles.