A comprehensive visual guide covering the major attention variants used in modern open-weight LLMs. Starting from the fundamentals of Multi-Head Attention (MHA), it progresses through Grouped-Query Attention (GQA), Multi-Head Latent Attention (MLA), Sliding Window Attention (SWA), DeepSeek Sparse Attention (DSA), Gated Attention, and hybrid architectures combining linear/state-space modules with full attention. Each variant is explained with its motivation (primarily KV-cache efficiency and long-context scaling), trade-offs, and real-world adoption in models like Llama 3, Gemma 3, DeepSeek V3, Qwen3, and Kimi K2. The guide also discusses emerging hybrid designs (Qwen3-Next, Kimi Linear, Ling 2.5, Nemotron) that replace most attention layers with cheaper recurrent or linear mechanisms while retaining a few full-attention layers for retrieval.
Table of contents
1. Multi-Head Attention (MHA)2. Grouped-Query Attention (GQA)3. Multi-Head Latent Attention (MLA)4. Sliding Window Attention (SWA)5. DeepSeek Sparse Attention (DSA)6. Gated Attention7. Hybrid AttentionConclusionSort: