Decoder-Only Architectures, KV Cache, and MLA

There's been a lot of changes since the 2017 transformer. That model lapped RNNs at sequence-to-sequence tasks (machine translation, summarization, speech-to-text) due to its encoder-decoder architecture, but modern LLMs embrace a simpler architecture optimized for next-token prediction. Let's look at this architectural evolution and examine two optimizations in particular that make modern inference possible: KV caching and Multi-head Latent attention (MLA).

the classical transformer decoder block (2017)

layer =
  LayerNorm
  ├─ masked self-attention  (causal)
  ├─ residual
  LayerNorm
  ├─ cross-attention over encoder K,V
  ├─ residual
  LayerNorm
  ├─ feed-forward network (FFN)
  └─ residual

Cross-attention distinguishes decoder blocks from their encoder counterparts. Instead of computing keys (K) and values (V) from its own hidden state, the decoder would use the encoder's final hidden state as a reference while computing queries (Q) from its current state. This mechanism allowed each position in the decoder to selectively draw information from the source sequence, reminiscent of Bahdanau's original intent (Bahdanau et al).

The decoder also introduced causal masking in its self-attention layer, ensuring that predictions could only depend on previous tokens. This enables autoregressive generation during inference by preventing information leakage from future tokens during training.

the modern decoder block

layer =
  LayerNorm
  ├─ masked self-attention
  ├─ residual
  LayerNorm
  ├─ feed-forward (Gated-SwiGLU, MoE, or multi-branch)
  └─ residual

The modern architecture dispenses with cross-attention entirely—there's no encoder to reference—relying solely on masked self-attention. This simplification, combined with innovations in feed-forward networks like Gated-SwiGLU and Mixture of Experts (MoE), has proven to be remarkably powerful.

modern optimizations

KV Cache

Every attention layer computes three key components during generation:

Q_i - what information do I want right now at token position i?
K_j - what information is stored at position j?
V_k - 'payload' of position k

We realize something crucial: while Q needs fresh computation for each new token, K and V values for past tokens remain constant. By caching these values, we dramatically reduce computation at the cost of memory, a trade-off that defines modern inference optimization.

Multi-head Latent Attention (MLA)

DeepSeek v2 introduced MLA as an elegant solution to the memory demands of KV caching. The innovation lies in projecting the KV cache into a compressed latent space and reconstructing it when needed. This method, illustrated:

View full image

The results include a 93.3% reduction in KV cache memory footprint while achieving 5.76× faster generation throughput without sacrificing benchmark evaluations.

Some Reflections

Attention itself began as a solution to RNN limitations, with Bahdanau's additive attention enabling adaptive focus on input sequences. This evolved into Luong's dot-product attention, simplifying computation through matrix multiplication. The Transformer architecture then revolutionized the field by eliminating recurrence entirely, introducing scaled dot-product attention and multi-head mechanisms.

GPT-1 marked another pivotal moment, demonstrating that decoder-only architectures could excel at next-token prediction. Performance optimizations followed naturally: KV caches emerged in Google's Tensor2Tensor and GPT-1, while attention variants like MQA, GQA, and MLA tackled attention's quadratic complexity.

Each innovation builds upon and reimagines what came before—scrapping some ideas, modifying others, and occasionally returning to old concepts with fresh perspective. In this light, each milestone in attention's story isn't just a technical achievement, but a testament to how research advances: through the delicate art of knowing what foundations to build upon, what constraints to shed, and what new directions to explore.

References

DeepSeek-AI. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.