The Attention Mechanism

Why Attention?

Consider the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? For a human reader the answer is obvious: "it" means the cat. But arriving at this answer requires looking back across the sentence and connecting a pronoun to the noun it references. A model that processes each token in isolation, without any ability to look at other positions, has no way to make this connection.

A simple feed-forward network applied independently at each position treats every token as if the rest of the sequence does not exist. It can transform each token's representation, but it cannot move information between positions. Pronoun resolution, subject-verb agreement, long-range dependencies: none of these are possible without some mechanism for tokens to communicate with one another.

The attention mechanism solves this problem [1]. It provides a structured way for each token to look at every other token in the sequence, decide which ones are relevant, and gather information from them. Rather than processing tokens in isolation, attention lets the model build context-dependent representations where each token's output reflects the entire input it has seen so far.

Queries, Keys, and Values

Attention organizes the communication between tokens around three learned roles. Every token simultaneously plays all three:

Attention (Intuition): Each token participates in attention through three projections. The query ( $\mathbf{q}$ ) asks "what am I looking for?", the key ( $\mathbf{k}$ ) advertises "what do I contain?", and the value ( $\mathbf{v}$ ) provides "what information do I send if attended to?"

Each role is produced by multiplying the token's embedding by a learned weight matrix. For a token at position $i$ with embedding $\mathbf{x}_i \in \mathbb{R}^{d_{\text{model}}}$ , the three projections are:

\mathbf{q}_i = \mathbf{x}_i W_Q, \quad \mathbf{k}_i = \mathbf{x}_i W_K, \quad \mathbf{v}_i = \mathbf{x}_i W_V

The projection matrices $W_Q, W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ map the input down to a $d_k$ -dimensional query/key space, while $W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ maps to the value space. These are three different "views" of the same input, each optimized by gradient descent for a different purpose during training.

The Attention Equation

Scaled dot-product attention diagram showing Q, K, and V inputs flowing through MatMul, Scale, optional Mask, SoftMax, and a final MatMul to produce the output. — Scaled dot-product attention. The query and key vectors are combined via dot product, scaled, optionally masked, normalized with softmax, and used to weight the value vectors. From Vaswani et al., *Attention Is All You Need*.[2]

With queries, keys, and values defined, the attention mechanism proceeds in three steps: compute relevance scores, normalize them into weights, and use the weights to mix value vectors.

Step 1: Dot-product scores. How much should token $i$ attend to token $j$ ? The model measures this by computing the dot product between the query of token $i$ and the key of token $j$ :

e_{i,j} = \mathbf{q}_i^T \mathbf{k}_j

A large dot product means the query and key point in similar directions, indicating the model has learned that these two tokens are relevant to each other.

Step 2: Scaling. The raw dot-product scores grow in magnitude with the dimension $d_k$ , which can push the softmax into regions with vanishingly small gradients. The fix is simple: divide by $\sqrt{d_k}$ :

e_{i,j} = \frac{\mathbf{q}_i^T \mathbf{k}_j}{\sqrt{d_k}}

Step 3: Softmax normalization. The scaled scores are passed through a softmax to produce a probability distribution over positions:

\alpha_{i,j} = \frac{\exp(e_{i,j})}{\sum_k \exp(e_{i,k})}

Now $\alpha_{i,j} \geq 0$ and $\sum_j \alpha_{i,j} = 1$ . Each weight $\alpha_{i,j}$ tells us how much attention token $i$ pays to token $j$ .

The output. The final output for token $i$ is a weighted sum of value vectors:

\text{out}_i = \sum_j \alpha_{i,j} \mathbf{v}_j

In plain terms: gather information from other tokens, weighted by relevance. Tokens with high attention weight contribute more to the output; tokens with near-zero weight are effectively ignored.

Putting it all together in matrix form, where $Q$ , $K$ , and $V$ stack the queries, keys, and values for all tokens:

\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This single equation is the core of the transformer [3]. Everything else in the architecture (multi-head attention, the residual stream, MLPs) is built around it.

A Worked Example

To make the attention equation concrete, we trace a single attention head on a 3-token sequence with $d_k = 2$ . The tokens are A, B, and C, and we compute the attention output for token C (the final position).

Setup. Suppose the query, key, and value vectors are:

Token	Query $\mathbf{q}$	Key $\mathbf{k}$	Value $\mathbf{v}$
A	—	$(1, 0)$	$(1, 0, 0)$
B	—	$(0, 1)$	$(0, 1, 0)$
C	$(1, 1)$	$(1, 1)$	$(0, 0, 1)$

We only need C's query (since we are computing attention from position C) and all three keys and values.

Step 1: Dot-product scores. Token C's query is compared against each key:

e_{C,A} = \mathbf{q}_C^T \mathbf{k}_A = (1)(1) + (1)(0) = 1

e_{C,B} = \mathbf{q}_C^T \mathbf{k}_B = (1)(0) + (1)(1) = 1

e_{C,C} = \mathbf{q}_C^T \mathbf{k}_C = (1)(1) + (1)(1) = 2

Step 2: Scale by $\sqrt{d_k}$ . With $d_k = 2$ , we divide by $\sqrt{2} \approx 1.41$ :

\tilde{e}_{C,A} = 0.71, \quad \tilde{e}_{C,B} = 0.71, \quad \tilde{e}_{C,C} = 1.41

Step 3: Softmax. Converting to attention weights:

\alpha_{C,A} = \frac{e^{0.71}}{e^{0.71} + e^{0.71} + e^{1.41}} = \frac{2.03}{2.03 + 2.03 + 4.10} \approx 0.25

\alpha_{C,B} \approx 0.25, \quad \alpha_{C,C} \approx 0.50

Token C attends most strongly to itself (50%), with equal attention to A and B (25% each). The self-attention is strongest because C's key aligns most with its own query (dot product of 2 vs. 1).

Step 4: Weighted sum of values. The output for token C is:

\text{out}_C = 0.25 \cdot (1, 0, 0) + 0.25 \cdot (0, 1, 0) + 0.50 \cdot (0, 0, 1) = (0.25, 0.25, 0.50)

The output is dominated by C's own value vector, with smaller contributions from A and B. This is the information that this attention head writes to the residual stream at position C.

The key observation: the attention pattern (who attends to whom) is entirely determined by the dot products between queries and keys. The values are passive passengers, mixed according to whatever weights the QK interaction produces. These are two independent computations, which is the foundation of the QK/OV circuit decomposition we will develop later.

Self-Attention and Causal Masking

In self-attention, the queries, keys, and values all come from the same input sequence. Given an input matrix $X$ (one row per token), we compute $Q = XW_Q$ , $K = XW_K$ , and $V = XW_V$ . The sequence attends to itself: every token can look at every other token and decide what information to gather. This is how a transformer lets all positions interact in a single step, producing context-dependent representations where each token's output reflects its relationship to the entire input.

For each token position, self-attention performs a complete information-gathering operation: it examines all other positions via the query-key match, decides how much to attend to each via softmax, collects the relevant information as a weighted sum of values, and writes the result back. Each token's output is therefore a context-dependent mixture of all tokens' value vectors.

In decoder-only transformers (such as GPT), there is an additional constraint: each token can only attend to itself and earlier tokens. This is enforced by setting $e_{i,j} = -\infty$ for all $j > i$ before the softmax, which drives those attention weights to zero. This is called causal masking. The reason is simple: during autoregressive generation, future tokens do not exist yet. The model must predict the next token using only the past, so the attention mechanism must respect this constraint during both training and inference.Causal masking gives mechanistic interpretability a clean experimental setup. At each position i, we know exactly what information is available to the model: tokens 0 through i. This makes it possible to reason precisely about what the model could and could not have used to produce its output.

Multi-Head Attention

A single attention head can only learn one attention pattern, one way of deciding which tokens are relevant to which. But language requires attending to multiple things simultaneously. A token might need to attend to the previous token (for syntax), to the subject of the sentence (for semantics), and to a matching pattern earlier in the text (for repetition), all at the same time.

The solution is to run multiple attention heads in parallel, each with its own learned projection matrices. Each head $h$ has its own $W_Q^h$ , $W_K^h$ , and $W_V^h$ , and computes attention independently:

\text{head}_h = \text{Attn}(XW_Q^h,\; XW_K^h,\; XW_V^h)

The outputs of all heads are concatenated and projected through a final output matrix $W_O$ :

\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H)\, W_O

Why is $W_O$ needed? Because each head operates in a small $d_v$ -dimensional subspace, its output cannot be added directly to the $d_{\text{model}}$ -dimensional residual stream. The output matrix $W_O \in \mathbb{R}^{(H \cdot d_v) \times d_{\text{model}}}$ maps the concatenated head outputs back into the full residual stream space. It also lets each head learn how to write its result back: which dimensions of the residual stream to update and with what mixture. In mechanistic interpretability, the combined matrix $W_V^h W_O^h$ (the slice of $W_O$ corresponding to head $h$ ) is called the OV circuit of a head: it determines what information the head moves from source to destination.

Independent Heads: Each attention head is an independent information-moving operation with its own learned pattern. Head $h$ reads from the residual stream, processes it through its own $d_k$ -dimensional subspace, and writes back to the residual stream.

With $H$ heads and $d_k = d_{\text{model}} / H$ , the total parameter count is the same as a single large head, but the model can attend to $H$ different things at once.The projection to a lower-dimensional space creates a bottleneck. Each attention head operates in a subspace of dimension d_k, which is typically d_model / H where H is the number of heads. This low-rank structure is important for mechanistic interpretability because it means each head can only attend to and move information along a limited set of directions. For mechanistic interpretability, this is a major advantage: we can study each head individually to understand what it does.

In practice, different heads specialize in remarkably specific patterns. Previous token heads consistently attend to the immediately preceding token. Induction heads complete patterns by looking for previous occurrences of the current token and attending to what came after. Name mover heads copy proper names to later positions where they are needed for prediction. We will explore induction heads and other specialized head types in later articles.

To see why multiple heads matter, consider the sentence "The tired cat sat on the mat because it was tired" at the token position "it." Different heads can extract different relationships from the same position simultaneously:

Head A might attend from "it" back to "cat," resolving the pronoun to its referent.
Head B might attend from "it" to the first "tired," tracking which property is being referenced.
Head C might attend from "it" to "sat," tracking the main verb of the clause.

No single head could serve all three purposes at once. Each head's QK circuit determines a different relevance pattern, and each head's OV circuit copies different information. The concatenation of their outputs gives the model simultaneous access to the referent, its property, and the action, all from a single attention layer.

Each attention head is a separate information-moving operation with its own learned pattern. Understanding what each head does is a core goal of mechanistic interpretability.

Pause and think: From architecture to interpretability

If attention heads move information between positions, what determines which information gets moved and where it goes? The query and key matrices determine the "where" (which positions attend to which), while the value and output matrices determine the "what" (which information gets read and written). Decomposing attention into these two circuits, the QK circuit and the OV circuit, is one of the first steps in mechanistic interpretability.

Looking Ahead

Attention is how information moves between positions. But each transformer layer has a second component: the MLP, which transforms information within each position. Where attention routes information, MLPs process it, acting as key-value memories that store and retrieve knowledge. The next article covers how MLPs work and what they compute.

After that, layer normalization addresses the practical complication of keeping activations stable across many layers, and the QK/OV circuit decomposition formalizes the two-circuit structure hinted at above into the mathematical framework that underpins mechanistic interpretability.