mechanistic interpretability, part 1

tl;dr: Anthropic’s “A Mathematical Framework for Transformer Circuits” reduces the transformer to its fundamental parts—no MLPs, no layer-norms, just attention—and then rebuilds it layer-by-layer to show how surprisingly rich behavior (copying, skip-trigrams, induction) emerges from simple linear algebra. The key mental shifts are: (1) treat the residual stream as the main communication channel; (2) remember that attention heads are independent and additive; and (3) see attention itself as moving information via two almost-disjoint circuits, QK (where to look) and OV (what to write). Under that lens, zero-layer models look like bigram tables, one-layer models learn skip-trigrams, and two-layer models invent induction heads. This culminates in the beginnings of in-context learning.

some motivations

I saw that Anthropic open-sourced their circuit-tracing tools while traveling in Japan, and I haven't had the time to play around with them until now. Looking at the tools, I noticed some fundamental knowledge gaps (what is an attribution graph?), so I had to work back from there. I found 'A Mathematical Framework for Transformer Circuits' to be a solid bridge between, conveniently, the decoder-only transformer architecture that I recently familiarized myself with and the basics of mech interp.

simplifying the transformer

In this paper, the team works with a slimmed down transformer. Namely, a decoder-only transformer without layernorm, biases, or the feed-forward neural net in every block. As a result, we have an attention-only transformer. The simplified flow goes something like token, embedding, N residual blocks (self-attention only), and an unembedding to logits.

rethinking the residual stream

The first key rethink has to do with residual connections. To use an analogy, most diagrams show them as skinny side-pipes, but this paper views them as the main highway, the primary communication channel that connects the transformer. Concretely, the residual vector (or the residual stream) is a high-dimensional scratchpad read and written to by every layer. See the diagram below:

This framing allows us to treat any pair of layers as if they were connected by virtual weights even if they are separated by several blocks. This comes up again in two-layer transformers.

heads are independent and additive

The concatenate-then-project implementation of multi-head attention in the 2017 paper hides the true mechanics (it's simpler than it looks). Split the usual output matrix WO into H blocks, and you see that each head's contribution is just added into the stream—no fancy concatenation needed. That independence justifies studying one head at a time and later composing them algebraically.

attention as moving information

Note: the notation is a little bit different here. instead of expressing Q as x WQ, Q is expressed as WQ x. This just means the dimensions are flipped, but there is no fundamental difference.

We can express the result vectors rir_i of attention like this below:

ri=AV=AijVjr_i = AV = \sum A_{ij}V_j

where vi=WVxi.v_i = W_V x_i.

We can use the language of tensor products to reformulate this as

h(x)=( extIdWO)(A extId)( extIdWV)h(x) = (\ ext{Id} \otimes W_O)\cdot(A \otimes \ ext{Id})\cdot (\ ext{Id} \otimes W_V)

and by mixed product property,

h(x)=(AWOWV)x.h(x) = (A\otimes W_O W_V)\cdot x.

With this expression, we can see the attention head as basically applying two linear operations, AA and WOWVW_O W_V. AA decides which token to move information from and move information to, while WOWVW_O W_V decides what information is read from the source token (src) and how it is written to the destination token (dst). Loosely, due to the mathematical separation, the team claims that "which tokens to move information from is completely separable from what information is 'read' to be moved and how it is 'written' to the destination."

also,

σ(QTK)\sigma(Q^T K)

can be expressed as

σ((WQx)T(WKx))=σ(xTWQTWKx)\sigma((W_Q x)^T (W_K x)) = \sigma (x^T W_Q^T W_K x)

In this alternate expression for attention, we can see that WOW_O and WVW_V are always together and WQTW_Q^T and WKW_K are as well. Let WOVW_{OV} be WOWVW_O W_V and WQKW_{QK} be WQTWKW_Q^T W_K.

One last thing: products of attention heads behave like an attention head. the paper refers to these products as 'virtual attention heads'.

(Ak2WOVk2)(Ak1WOVk1)=(Ak2Ak1WOVk2WOVk1).(A^{k_2} \otimes W_{OV}^{k_2})\cdot (A^{k_1} \otimes W_{OV}^{k_1}) = (A^{k_2}A^{k_1} \otimes W_{OV}^{k_2}W_{OV}^{k_1}).

zero-layer transformer

With only embedding weights and unembedding weights, a zero-layer transformer resembles a bigram model. Mathematically:

T=WUWE.T = W_U W_E.

We're going to see this expression again in every transformer. This term corresponds to the 'direct path': the token embedding flows directly down the residual stream to the unembedding without going through any layers.

one-layer transformer

Here's a diagram of the one-layer transformer below:

Expressing this in a tensor product,

T= extIdWU(Id+hAhWOVh) extIdWET = \ ext{Id} \otimes W_U \cdot (Id + \sum_{h} A^h \otimes W_{OV}^h) \cdot \ ext{Id} \otimes W_E

where

Ah=σ(tTWETWQKhWEt).A^h = \sigma(t^T W_E^T W_{QK}^h W_E t).

(Recall x=WEtx=W_{E}t.)

Combining with tensor product properties, we have

T= extIdWUWE+hAh(WUWOVhWE).T = \ ext{Id}\otimes W_U W_E + \sum_{h} A^h \cdot (W_U W_{OV}^h W_E).

We can recognize the first part from the zero-layer transformer: it's the direct path from embed to unembed. The second part denotes the attention head part, with AhA^h mirroring the role of AA and WUWOVhWEW_U W_{OV}^h W_E mirroring the role of WOVW_{OV} from earlier.

We call the expressions WETWQKhWEW_E^T W_{QK}^h W_E the query-key (QK) circuit and WUWOVhWEW_U W_{OV}^h W_E the output-value (OV) circuit. We can interpret these as paths through the model and consequently describe their function: the QK circuit goes from WEW_E to WQW_Q to WKW_K to WEW_E and controls which tokens the head attends to, while the OV circuit goes from WEW_E to WVW_V to WOW_O to WUW_U, determining how attending to a given token affects the logits.

The key about the QK and OV circuits: they are individual functions we can understand. With them, we can think of logits as linear functions of the tokens, at least for a simplified one-layer transformer.

In one-layer transformers, we can look at the most largest embeddings in the QK/OV circuits to see some of the connections that the transformer has made. Since there is only one attention layer, these connections come in the form of 'skip-trigrams', or triplets in the form of source, ... destination, and out. It's worth noting that analyzing these skip-trigrams is manually intensive and requires domain-specific knowledge, motivating Anthropic's later work automated circuit discovery.

two-layer transformer

This section's introduction is so good that I'm referencing it directly:

Deep learning studies models that are deep, which is to say they have many layers. Empirically, such models are very powerful. Where does that power come from? One intuition might be that depth allows composition, which creates powerful expressiveness.

Composition of attention heads is the key difference between one-layer and two-layer attention-only transformers. Without composition, a two-layer model would simply have more attention heads to implement skip-trigrams with. But we'll see that in practice, two-layer models discover ways to exploit attention head composition to express a much more powerful mechanism for accomplishing in-context learning. In doing so, they become something much more like a computer program running an algorithm, rather than look-up tables of skip-trigrams we saw in one-layer models.

With one layer, the model already can build crude skip-trigrams, but its heads work independently. Adding a second layer allows heads to compose: the output of one head can alter the keys, queries, or values seen by another. That single extra step upgrades the model from "look-up table" to something closer to an algorithm—most famously, the induction head, which spots a repeated pattern and extends it.

(there are technicals here, and one day I will break it down, but as of now I feel a bit intimidated by the 6-dimensional tensor and armed with too few tools for tensors)

here's how induction heads work:

  1. K-shift trick. Layer 1 head h0h_0: attends to the previous token,copies it forward, andwrites a vector encoding token ti1t_{i-1}. Layer 2 head h1h_1: its WKW_K latches onto that vector. As a result the key at position i encodes token ti1t_{i-1}, not tit_i.
  2. Q matches K. If the current query token is the same as some earlier tjt_j, then KjK_j matches QiQ_i. Attention spikes on j.
  3. V copies forward. The value vector at j is just the _next token_ tj+1t_{j+1} (thanks to the same layer-1 copier). h1h_1 pastes tj+1t_{j+1} into position i’s residual. The unembedding projects it into a logit boost for tj+1t_{j+1}.

Effectively, the model extends any pair (a b) it has seen before. This is the flagship example of in-context learning.

onwards

I think it's amazing that two layers are enough to invent non-trivial algorithms. In this paper, the team remarks that 'the clearest path forward would require individually interpretable neurons, which we've had limited success finding.' Knowing that Golden Gate Claude exists, it's insanely exciting to see how they discovered the model's features after struggling in 2021. All that's left to do is read. and add more layers.

references

  • Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., ... & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html