## Transformer Cookbook
This is a hyper-condensed glossary of formulas that define a transformer. The notation and mental framework I use is like the one in [*A Mathematical Framework for Transformer Circuits*](https://transformer-circuits.pub/2021/framework/index.html). This is meant for reference, so it won't read smoothly!
## Bread and Butter
### Residual Stream
Tokens are fed to the model in a matrix $X$; each row of this matrix is one token embedding. This is the initial state of the *residual stream*. The attention layers of the transformer, and the MLP layers take $X$, do something to it, and add the result back to the residual stream. The state of the residual stream at layer $l$ is $X_l$.
### Attention head
One single attention head performs
$
\begin{align}
h_n(X)&= \text{softmax}(Q_nK_n^\intercal)V_n \\
&= \text{softmax}(XW_n^QW_n^{K\intercal}X^\intercal)XW^V_n W^O_n \\
&= \text{softmax}(XW_n^{QK}X^\intercal)XW^{VO}_n
\end{align}
$
Where
- $\text{shape}(X) = [T_{seq}, d_{model}]$
- $\text{shape}(W^Q_n) = \text{shape}(W^K_n) = \text{shape}(W^V_n) = [d_{model}, d_{model}/n_{heads}] = [d_{model}, d_k]$
- $\text{shape}(W_n^O) = [d_k, d_{model}]$
In the usual $\text{concat}(head_1, head_2,...)$ definition, the $W^O$ matrix is saved up and applied to the concatenated heads after they are all softmaxed, but $d_k$ bunches of rows can be applied separately to their corresponding heads instead.
### Attention layer
One attention layer employs $n_{heads}$ attention heads in parallel and adds their results back to the residual stream
$
\begin{align}
X_{l+1} = X_{l} + \sum_n h_n(X_l)
\end{align}
$
### MLP layer
One MLP layer does
$\text{max}(0, XW_{up} + b_{up} )W_{down} + b_{down}$
Where
- $\text{shape}(W_{up}) = [d_{model}, 4\times d_{model}]$ (could be another multiple, is usually 4)
- $\text{shape}(b_{up}) = [1, 4\times d_{model}]$
- $\text{shape}(W_{up}) = [d_{model}, 4\times d_{model}]$
The biases are optional, **and I usually ignore them**. Transformers actually work fine without them, [Mistral 7B](https://docs.mistral.ai) for example (cutting edge at the time of writing) [doesn't use biases in the MLP layer](https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/mistral/model.py#L120).
## Tokenizer
Tokenizing is the process of breaking up an input into distinct units which will be fed to the model. In NLP it's the mapping from an input sequence to the matrix $X$.
**Byte pair encoding (BPE):** BPE works by iteratively tallying the occurrence of each unique pair of characters, merging together the most frequent pair into a single character, and recording the merge into a dictionary so that it can be reversed. Repeating this many times results in a compressed version of the text, and a dictionary to expand encoded text back to its original form.