Thinking about Transformer Interpretability
Back in September 2025, Neel Nanda posted his guide on “How to Become a Mechanistic Interpretability Researcher”. This came out at just the right time for me, as I was wrapping up my internship at Meta and was looking forward to transitioning my focus toward research in mechanistic interpretability. In addition to many other activities (which I spent much of my fall quarter pursuing), one of his highly recommended reads was the classic interpretability paper on “A Mathematical Framework for Transformer Circuits”. I’d seen the paper pop up a couple times, including Logan’s notes on the paper, and over winter break, I figured I would dedicate a day reading it in depth.
I ended up getting a lot more out of the paper than I originally expected, primarily because my understanding of transformers was relatively superficial and incomplete beforehand. The paper had a lot of big ideas, but the conceptual take-aways section under the “Summary of Results” is relatively encapsulatory of the important ideas. The takeaways alone do not suffice for true understanding though, as I think the process of reading and working through the paper is far more valuable.
Part of the inspiration behind this style of post is a recent article from Holden Karnofsky about “Learning by Writing”. The overall concept of learning by actively playing with the material is useful, but I specifically want to hone in on his approach of reading with the intent of forming an opinion. Though I did spend a sizable amount of time reading through this paper already, and I did actually internalize the main conceptual takeaways, my opinions on what this means for how I approach interpretability research moving forward remain untouched. I hope that through writing this piece, I will think critically about the significance of the key takeaways and develop some conception of this paper’s implications.
Big Ideas
Since my aim is to distill not just the main takeaways but their implications, I’m not going to spend very much time on the details of the paper itself. I will proceed assuming the reader understands the conceptual takeaways and the summary of follow-up research, as well as how a transformer works. I’ll start by giving the useful (as relevant to interpretability, in my opinion) view of how transformers work along with their fundamental components and mechanisms (ie. attention heads, the residual stream, MLPs, etc.). The paper focused on small attention-only transformers, but since I care about what the paper implies for my general understanding of transformers, I’ll be sharing my updated understanding of the default transformer architecture (eg. GPT-2).
So a transformer is essentially this architecture that begins by taking in a sequence of tokens (assume no batching) and then embedding tokens into a d_model-dimensional space, which is the residual stream. Then, for however many layers deep the model is, it repeats the following cycle:
- Pass tokens into an attention block and add the output of the attention block back to tokens (ie. tokens += attention_block(tokens))
- Pass tokens (which has now been modified by attention block) into a MLP and add the output of the MLP feedforward network back to tokens (ie. tokens += MLP(tokens))
Then take this final representation of tokens, unembed it back to d_vocab space using an unembedding matrix, softmax to get a probability distribution, and sample from the distribution to get a next token. We’re now going to focus on the attention block and the MLP (ignoring LayerNorm and associated computations), as they make up the bulk of the transformer’s computation.
The attention block is comprised of several independent attention heads. Each attention head has two independent computations: the QK-circuit and the OV-ciruit. The QK-circuit is a low-rank matrix that computes an attention pattern for the sequence, and the OV-circuit is another low-rank matrix that computes the attention-relevant information to be transported. The attention pattern is used to weight the importance of the attention-relevant information before it is added to the residual stream.
There are a couple of points worth noting about the computation of the attention block. The first is that the output of an attention head is low-rank because the OV-circuit (which is the information that is added to the residual stream after being scaled by the attention pattern) is itself low-rank. The second is that each attention head is its own independent computation. The third is that the output of the attention block is added to the residual stream. In combination, we can say the following: attention blocks are comprised of attention heads that independently compute low-rank cross-token information updates that get added to the residual stream. We will discuss the implications of this in greater depth momentarily.
The MLP block operates on individual tokens (whereas the attention block manipulated information across tokens) in the following manner (assuming GELU for the activation function): MLP(x) = W2(GELU(W1x + b1)) + b2. Typically, W1 projects from d_model to 4 * d_model and W2 projects back down to d_model. The output of MLP(x) gets added back to the residual stream. The relevant points here are that the hidden dimension is larger than the dimension of the residual stream, that the non-linearity exists in the high-dimensional hidden layer, and once again that the output of the block is added (rather than written) to the residual stream.
What This Means for Transformer Interpretability
One point I emphasized several times in this high-level description of transformers is the addition back to the residual stream. The initial token embeddings are never overwritten, only added to and subtracted from. This is critical because this means that the residual stream preserves information from previous layers for future layers to read and add to. One could imagine an architecture in which the output of each layer partially or entirely overwrites the content of the residual stream; in this case, in order to preserve information about previous computations, the output of a given layer would need to contain both the information that it computed as well as the information computed by prior layers. In other words, the residual stream in the transformer architecture serves as a communication channel between layers.
The Residual Stream
This sets up the residual stream as a prime target for interpretability, except for the fact that it turns out to not be very interpretable. Why might that be? The short answer is that there’s just way too much going on in a transformer, and the residual stream is actually the bottleneck of computation.
A first noteworthy observation is that the MLP block does computation (including its nonlinearity) in a space with dimension four times greater than the residual stream. Thus, even though the MLP block adds its output to the residual stream, the computation that determines its output is not easily interpretable purely within the dimension of the residual stream.
Another point is that the actual residual stream dimension is not that large. It is usually significantly smaller than the vocabulary size (768 vs over 50,000 in GPT-2), so from the get-go, the residual stream is forced to efficiently represent the set of possible tokens in a tiny space. As attention and MLP blocks add to the residual stream, the residual stream must become even more economical as it accommodates increasingly nuanced concepts in the already compact dimension size it was working with. As such, the residual stream is an incredibly dense well of meaning, and moreover, its basis is not a privileged one.
So if not the residual stream, where else can we look?
Attention
Since attention heads are independent, they likely learn to specialize in their own set of abilities, and since their outputs are restricted to a specific subspace of the residual stream, it’s very reasonable to assume that the subspace that attention heads read and write from can be examined for interpretability purposes. There is some research (that I have not read) on such work, but it seems that attention heads within a layer are simultaneously multi-functional and incomplete units of computation (hence Anthropic’s exploration of cross-layer transcoders which I’m excited to eventually read). I’ve not read any literature on this but I feel like this should be a more sizable entry point for interpretability than I’ve been hearing about, so this is decently high on my preliminary list of topics to explore in mechinterp.
We have seen that attention heads will read and write from subspaces written to by earlier attention heads, and in combination with the idea of “virtual weights” in the paper, it seems that analyzing combinations of attention heads may also prove fruitful.
MLPs
It seems that MLPs are quite difficult to interpret. Not only do they have a high-dimensional activation space, they are also highly non-linear and are full-rank. Moreover, empirical research demonstrates that individual neurons exhibit polysemanticity and superposition. However, they do make up roughly two-thirds of the parameters of modern transformers, and recent work by a mentor of mine makes some progress in MLP activations interpretability.
Conclusions
I began writing this piece with some initial assumptions about how my views on transformer interpretability were going to be changed by the “A Mathematical Framework for Transformer Circuits” paper. I especially thought that attention head subspace interpretability and attention composition were going to be much more significant, although I’m still quite hopeful that I might find something interesting if I keep digging. I didn’t really touch on trigram statistics and induction heads, which were pretty significant in the original paper, mostly because I don’t really feel like they have that much significance in larger models other than being proof-of-concept of circuit-finding in large models. Regardless, writing through this blog post did help me think about what seemed promising in transformer interpretability research, although I suspect that I will probably return to this topic soon as I consider what topics I really care about exploring in mechanistic interpretability.