What This Refresher Covers
Mechanistic interpretability assumes comfort with a few core ideas from deep learning and linear algebra. This page is a quick refresh, not a full tutorial. If these concepts are familiar, you can skip ahead.
Neural Networks in 60 Seconds
A neural network composes linear maps with nonlinearities. For a single layer:
- is the input vector.
- is a learned weight matrix.
- is a bias vector.
- is a nonlinear activation (ReLU, GELU, etc.).
Stacking many layers produces a deep function that can represent complex behaviors.
Training and Backpropagation
Training adjusts weights to reduce a loss function . The loop is:
- Forward pass: compute predictions.
- Loss: measure error.
- Backward pass: compute gradients .
- Update: .
Mechanistic interpretability often treats a trained model as fixed and asks: what computation does this trained function implement?
Linear Algebra You Will See Constantly
Vectors as directions. A vector is a point or direction in -dimensional space. In transformers, activations and embeddings are vectors.
Matrices as linear maps. A matrix maps vectors between spaces: . Most transformer computations are matrix multiplications.
Dot product. The dot product measures similarity between two vectors. Attention scores are dot products between queries and keys.
Subspaces and projections. A subspace is a set of directions closed under addition and scaling. A projection removes the component of a vector along a direction. Many MI methods interpret features as directions and reason about projections onto them.
Notation Quick Reference
- Bold lowercase: vectors ,
- Bold uppercase: matrices , , ,
- Residual stream: for layer
- Attention head output: for head in layer
If this all feels easy, you are ready for the transformer foundations articles.