Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Transformers

RNNs are natural models for sequential data, but the O(T)\cO(T) time complexity of evaluation and backpropagating gradients is a severe limitation. In modern machine learning, one of the deciding factors is how many training epochs you can perform for a fixed computational budget. To that end, architectures that can process an entire sequence in parallel are advantageous. Transformers are one such architecture.

Transformers underlie large language models (LLMs) like Open AI’s ChatGPT and Google’s Gemini. They are also widely used in computer vision and other domains of machine learning. This chapter walks through the basic building blocks of a transformer: self-attention, token-wise nonlinear transformations, layer norm, and positional encodings. We follow the presentation of Turner (2023) with some modifications for consistency with these notes.

Preliminaries

Tokenization

Before a transformer can process text, the raw string must be converted into a sequence of discrete tokens drawn from a finite vocabulary V\cV. The choice of tokenization scheme involves a tradeoff: word-level tokenization yields short sequences but requires a very large vocabulary and cannot handle unknown words; character-level tokenization handles any input but produces very long sequences that stress the O(T2)O(T^2) attention cost.

Modern LLMs use subword tokenization, which decomposes text into vocabulary units that are between characters and words in granularity. The dominant algorithm is Byte Pair Encoding (BPE) Sennrich et al., 2016, which starts from a character vocabulary and iteratively merges the most frequent adjacent pair of symbols until the vocabulary reaches a target size V|\cV| (typically 32k–128k tokens). Common words become single tokens; rare words are split into subword pieces. Because BPE operates on bytes, it handles any Unicode text without unknown-token issues.

Each discrete token index zt{1,,V}z_t \in \{1, \ldots, |\cV|\} is then mapped to a continuous vector via a learned embedding matrix WeRV×D\mbW_e \in \reals^{|\cV| \times D}, giving xt(0)=(We)ztRD\mbx_t^{(0)} = (\mbW_e)_{z_t} \in \reals^D.

Let X(0)RT×D\mbX^{(0)} \in \reals^{T \times D} denote the matrix of token embeddings with xt(0)\mbx_t^{(0)} as its tt-th row.

The output of the transformer will be another matrix of the same shape, X(L)RT×D\mbX^{(L)} \in \reals^{T \times D}. These output features can be used for downstream tasks like sentiment classification, machine translation, or autoregressive modeling.

The output results from a stack of transformer blocks,

X()=transformer-block(X(1)).\mbX^{(\ell)} = \texttt{transformer-block}(\mbX^{(\ell-1)}).

Each block consists of two stages: one that operates vertically, combining information across the sequence length; another that operates horizontally, combining information across feature dimensions.

Transformer Block

Attention

The first stage combines information across sequence length using a mechanism called attention. Mathematically, attention is a weighted average,

Y()=A()X(1),\mbY^{(\ell)} = \mbA^{(\ell)} \mbX^{(\ell-1)},

where A()R+T×T\mbA^{(\ell)} \in \reals_+^{T \times T} is a row-stochastic attention matrix: sAts()=1\sum_{s} A_{ts}^{(\ell)} = 1 for all tt. Intuitively, At,s()A_{t,s}^{(\ell)} indicates how much output location tt attends to input location ss.

When using transformers for autoregressive sequence modeling, we constrain the attention matrix to be causal by requiring At,s()=0A_{t,s}^{(\ell)} = 0 for all s>ts > t — i.e., the matrix is lower triangular.

Self Attention

Self-Attention

Where does the attention matrix come from? In a transformer, the attention weights are determined by the pairwise similarity of tokens in the sequence. The simplest instantiation would be

At,s=exp{xtxs}s=1Texp{xtxs}.A_{t,s} = \frac{\exp \{ \mbx_t^\top \mbx_s\}}{\sum_{s'=1}^T \exp\{\mbx_t^\top \mbx_{s'}\}}.

(We drop the superscript (){}^{(\ell)} for clarity in this section.)

In practice, different feature dimensions convey different kinds of information. Transformers use separate linear projections for the queries and keys:

At,s=exp{(Wqxt)(Wkxs)/K}s=1Texp{(Wqxt)(Wkxs)/K},A_{t,s} = \frac{\exp \{ (\mbW_q \mbx_t)^\top (\mbW_k \mbx_s) / \sqrt{K}\}}{\sum_{s'=1}^T \exp\{(\mbW_q \mbx_t)^\top (\mbW_k \mbx_{s'}) / \sqrt{K}\}},

where WqxtRK\mbW_q \mbx_t \in \reals^{K} are the queries, WkxsRK\mbW_k \mbx_s \in \reals^{K} are the keys, and the 1/K1/\sqrt{K} factor prevents the dot products from growing large in magnitude Vaswani et al., 2017.

Self Attention with Queries and Keys

The KV Cache

During autoregressive generation, the model produces one token at a time. At step tt, all keys Wkxs\mbW_k \mbx_{s} and values Wvxs\mbW_v \mbx_{s} for s<ts < t were already computed at previous steps. Rather than recomputing them, an efficient implementation stores them in a KV cache and retrieves them when generating each new token. This reduces the per-step cost from O(tD2)O(t D^2) to O(D2)O(D^2), but the cache grows linearly with context length — a key memory bottleneck at inference time.

Multi-Headed Self-Attention

Just as a CNN uses a bank of filters in parallel, a transformer block uses HH attention heads in parallel. Let

Y(,h)=A(,h)X(1)Wv(,h)RT×K,\mbY^{(\ell,h)} = \mbA^{(\ell,h)} \mbX^{(\ell-1)} \mbW_v^{(\ell,h)\top} \in \reals^{T \times K},

where

At,s(,h)=exp{(Wq(,h)xt(1))(Wk(,h)xs(1))/K}s=1Texp{(Wq(,h)xt(1))(Wk(,h)xs(1))/K}A_{t,s}^{(\ell,h)} = \frac{\exp \{ (\mbW_q^{(\ell,h)} \mbx_t^{(\ell-1)})^\top (\mbW_k^{(\ell,h)} \mbx_s^{(\ell-1)}) / \sqrt{K}\}}{\sum_{s'=1}^T \exp\{(\mbW_q^{(\ell,h)} \mbx_t^{(\ell-1)})^\top (\mbW_k^{(\ell,h)} \mbx_{s'}^{(\ell-1)}) / \sqrt{K}\}}

for h=1,,Hh = 1, \ldots, H. The outputs are projected and summed:

Y()=h=1HY(,h)Wo(,h)mhsa(X(1)),\mbY^{(\ell)} = \sum_{h=1}^H \mbY^{(\ell,h)} \mbW_o^{(\ell,h)\top} \triangleq \texttt{mhsa}(\mbX^{(\ell-1)}),

where Wo(,h)RD×K\mbW_o^{(\ell,h)} \in \reals^{D \times K} maps each head’s output back to the token dimension.

Multi-Headed Self Attention

The Residual Stream

The residual connections in a transformer — introduced as an optimization technique in He et al. (2016) — have a deeper architectural interpretation. Writing out the full forward pass,

xt(L)=xt(0)+=1L[mhsa()(X(1))t+mlp()(yt())],\mbx_t^{(L)} = \mbx_t^{(0)} + \sum_{\ell=1}^{L} \left[ \texttt{mhsa}^{(\ell)}(\mbX^{(\ell-1)})_t + \texttt{mlp}^{(\ell)}(\mby_t^{(\ell)}) \right],

reveals that every component — every attention head and every MLP — adds its output directly onto a shared residual stream that begins as the token embedding and accumulates updates across all layers Elhage et al., 2021.

This perspective has several consequences:

Token-wise Nonlinearity

After the multi-headed self-attention step, the transformer applies a token-wise nonlinear transformation to mix feature dimensions. This is done with a feedforward network applied identically at each position,

xt()=mlp(yt()).\mbx_t^{(\ell)} = \texttt{mlp}(\mby_t^{(\ell)}).

Mixture of Experts

A mixture of experts (MoE) layer replaces the single MLP at each transformer block with EE independent “expert” MLPs and a learned router that selects a sparse subset of them for each token Shazeer et al., 2017. Given a token yt\mby_t, the router computes a probability over experts,

gt=softmax(Wgyt)RE,g_t = \mathrm{softmax}(\mbW_g\, \mby_t) \in \reals^E,

and routes the token to the top-kk experts (typically k=1k = 1 or k=2k = 2):

moe(yt)=etop-k(gt)gt,e  mlpe(yt).\texttt{moe}(\mby_t) = \sum_{e \in \mathrm{top\text{-}k}(g_t)} g_{t,e}\; \texttt{mlp}_e(\mby_t).

Because only kEk \ll E experts are active per token, the computation per token is the same as a dense model with a single MLP — but the total number of parameters scales with EE. This decoupling of parameter count from compute allows MoE models to reach much higher capacity for the same training FLOP budget. Fedus et al. (2022) showed that even k=1k = 1 (routing each token to a single expert) works well at scale; modern deployments such as Mixtral 8×7B use E=8E = 8 experts with k=2k = 2. The main practical challenges are load balancing (ensuring tokens are spread roughly evenly across experts, typically enforced with an auxiliary loss) and the communication overhead of routing tokens to different devices in a distributed setting.

Layer Norm

LayerNorm stabilizes training by z-scoring each token and applying a learned shift and scale:

layer-norm(xt)=β+γ(xtmean(xt)std(xt)),\texttt{layer-norm}(\mbx_t) = \mbbeta + \mbgamma \odot \left( \frac{\mbx_t - \texttt{mean}(\mbx_t)}{\texttt{std}(\mbx_t)} \right),

where β,γRD\mbbeta, \mbgamma \in \reals^D are learned parameters. LayerNorm is applied before each sub-layer (Pre-LN), which yields more stable training than the original Post-LN design:

Y()=X(1)+mhsa(layer-norm(X(1)))X()=Y()+mlp(layer-norm(Y())).\begin{aligned} \mbY^{(\ell)} &= \mbX^{(\ell-1)} + \texttt{mhsa}(\texttt{layer-norm}(\mbX^{(\ell-1)})) \\ \mbX^{(\ell)} &= \mbY^{(\ell)} + \texttt{mlp}(\texttt{layer-norm}(\mbY^{(\ell)})). \end{aligned}

This defines one transformer-block\texttt{transformer-block}. A transformer stacks LL such blocks to produce a deep sequence-to-sequence model.

Positional Encodings

Without explicit position information, a transformer treats its inputs as an unordered set of tokens — the architecture is permutation-equivariant (subject only to the causal mask). When the data have spatial or temporal structure, position must be injected explicitly.

Absolute Positional Encodings

The original transformer adds a fixed position vector to each token embedding:

xt(0)xt(0)+pt,\mbx_t^{(0)} \leftarrow \mbx_t^{(0)} + \mbp_t,

where ptRD\mbp_t \in \reals^D encodes the position using sinusoidal basis functions Vaswani et al., 2017. Learned absolute position embeddings are also common.

Rotary Position Embeddings (RoPE)

Absolute encodings bake position into the token embedding before the attention computation, which limits the model’s ability to generalize to sequence lengths unseen during training. Rotary Position Embeddings (RoPE) Su et al., 2024 instead encode position directly into the query–key dot product by rotating query and key vectors before the inner product.

Concretely, partition the KK-dimensional query and key into K/2K/2 pairs of consecutive dimensions. For pair ii, define the 2×22 \times 2 rotation matrix,

Rt(i)=(cos(tθi)sin(tθi)sin(tθi)cos(tθi)),\mbR_t^{(i)} = \begin{pmatrix} \cos(t\,\theta_i) & -\sin(t\,\theta_i) \\ \sin(t\,\theta_i) & \cos(t\,\theta_i) \end{pmatrix},

with base frequencies θi=b2i/K\theta_i = b^{-2i/K} for a large base bb (commonly b=10000b = 10000). Apply these rotations elementwise to the query and key before computing attention weights:

At,sexp ⁣{(Rtqt)(Rsks)/K}.A_{t,s} \propto \exp\!\left\{ (\mbR_t \mbq_t)^\top (\mbR_s \mbk_s) / \sqrt{K} \right\}.

Because (Rtqt)(Rsks)=qtRtRsks=qtRtsks(\mbR_t \mbq_t)^\top (\mbR_s \mbk_s) = \mbq_t^\top \mbR_t^\top \mbR_s \mbk_s = \mbq_t^\top \mbR_{t-s} \mbk_s, the dot product depends only on the relative position tst - s. This relative-position property makes RoPE extrapolate more gracefully to longer sequences than absolute encodings. RoPE is now the dominant positional encoding in open-weight LLMs (LLaMA, Mistral, Qwen, Gemma).

Autoregressive Modeling

To use a transformer for autoregressive modeling, predictions are read from the final layer’s representations. To predict the next token label zt+1{1,,V}z_{t+1} \in \{1,\ldots,V\} given past tokens z1:tz_{1:t}:

zt+1Cat(softmax(Wuxt(L))),z_{t+1} \sim \mathrm{Cat}(\mathrm{softmax}(\mbW_u \mbx_t^{(L)})),

where WuRV×D\mbW_u \in \reals^{V \times D} is the unembedding matrix. Like hidden states in an RNN, the final-layer representations xt(L)\mbx_t^{(L)} aggregate information from all tokens up to index tt.

Training

Standard practice is to use the AdamW optimizer with gradient clipping, a warmup-then-cosine learning rate schedule, and dropout. Treat these as hyperparameters; the optimal settings are model- and data-dependent.

Scaling Laws

One of the most practically useful findings in LLM research is that validation loss follows power-law scaling in model size NN (number of parameters) and dataset size DD (number of training tokens) Kaplan et al., 2020. The loss decreases smoothly as either resource grows, with the two contributions approximately additive — meaning a bottleneck on one resource cannot be compensated by adding more of another.

An immediate consequence is that for a fixed compute budget C6NDC \approx 6ND FLOPs, there is an optimal allocation between model size and training tokens. Kaplan et al. (2020) initially argued that model size should be prioritized, leading to the practice of training large models on relatively few tokens. Hoffmann et al. (2022) revisited this with more careful experiments and found that model size and training tokens should scale in roughly equal proportion: the Chinchilla result is that the compute-optimal number of training tokens is approximately D20ND^* \approx 20N. Concretely, a 7B-parameter model trained compute-optimally should see around 140B tokens — much more data than earlier models of comparable size used.

This has had an outsized practical impact: post-Chinchilla models (LLaMA and its successors) train smaller models on far more data than pre-Chinchilla norms, achieving better performance at inference time for the same training compute.

Open-Weight LLMs

The transformer landscape is evolving too rapidly for any static summary to remain useful for long. That said, students interested in working with LLMs directly have access to a growing ecosystem of high-quality open-weight models — models whose weights are publicly released even if training details are not always fully disclosed. Current families worth being aware of include Meta’s LLaMA series Touvron et al., 2023Grattafiori et al., 2024, Mistral AI’s Mistral and Mixtral models Jiang et al., 2023Jiang et al., 2024, Google’s Gemma family Gemma Team, 2024, and Alibaba’s Qwen series. These span a range of sizes (1B–405B parameters) and are available through Hugging Face, Ollama, and similar platforms. All of them implement the core building blocks described in this chapter — Pre-LN, RoPE, GQA, and SwiGLU — with the main differences lying in scale, training data, and fine-tuning for instruction following.

Conclusion

Transformers achieve parallelism over a full sequence by replacing the sequential hidden-state recursion with multi-headed self-attention: each token directly attends to all previous tokens, enabling gradient information to flow in a single step rather than through TT chained Jacobians. Viewing the architecture through the lens of the residual stream clarifies that attention heads and MLPs all operate on a shared representational space, with each component contributing a low-rank additive update. The quadratic O(T2)O(T^2) cost of softmax attention is a practical bottleneck for long sequences and has motivated substantial recent work on more efficient alternatives.

References
  1. Turner, R. E. (2023). An Introduction to Transformers. arXiv Preprint arXiv:2304.10557.
  2. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725.
  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.
  4. Qwen Team. (2025). Qwen3 Technical Report. arXiv Preprint arXiv:2505.09388.
  5. Ainslie, J., Lee-Thorp, J., de Jong, M., Zeiler, M., & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 4895–4901.
  6. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  7. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., & others. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
  8. Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv Preprint arXiv:2002.05202.
  9. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. International Conference on Learning Representations.
  10. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research, 23(1), 5232–5270.
  11. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., & Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568, 127063.
  12. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv Preprint arXiv:2001.08361.
  13. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., & others. (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems, 35, 30016–30030.
  14. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., & others. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv Preprint arXiv:2307.09288.
  15. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Keskar, A., & others. (2024). The Llama 3 Herd of Models. arXiv Preprint arXiv:2407.21783.