Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper?

In this video, we break down several core ideas that make attention efficient and scalable.

00:00 Introduction
00:35 Tokenization
01:21 Attention (vector form)
04:26 Attention (matrix form)
07:07 Key-Value caching
09:42 Multi-Query Attention (MQA)
11:03 Grouped Query Attention (GQA)
13:32 Multi-head Latent Attention (MLA)
15:37 MLA at inference time
18:15 Applying RoPE to MLA (decoupled RoPE)
22:18 DeepSeek Sparse Attention (DSA)
23:57 Quantization and rotation in DSA
27:44 DSA training

References:
- Multi-Head Attention (MHA): https://arxiv.org/abs/1706.03762
- Multi-Query Attention (MQA): https://arxiv.org/abs/1911.02150
- Grouped Query Attention (GQA): https://arxiv.org/abs/2305.13245
- Multi-head Latent Attention (MLA): https://arxiv.org/abs/2405.04434
- DeepSeek Sparse Attention (DSA): https://api-docs.deepseek.com/news/news250929
- Rotary Position Embedding (RoPE): https://arxiv.org/abs/2104.09864

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

A deep-dive into the evolution of attention mechanisms in large language models, building from first principles. Covers multi-head attention (MHA), KV caching, multi-query attention (MQA), grouped query attention (GQA), and multi-head latent attention (MLA) used in DeepSeek V3/R1. Explains how MLA achieves a 57x reduction in KV cache memory via low-rank compression. Also introduces DeepSeek Sparse Attention (DSA), which uses a 'lightning indexer' with 8-bit quantization and Hadamard transforms to select only the most relevant tokens, achieving 2-3x faster long-sequence processing with 30-40% less memory while maintaining model performance.

How Attention Got So Efficient [GQA/MLA/DSA]