A deep-dive into the evolution of attention mechanisms in large language models, building from first principles. Covers multi-head attention (MHA), KV caching, multi-query attention (MQA), grouped query attention (GQA), and multi-head latent attention (MLA) used in DeepSeek V3/R1. Explains how MLA achieves a 57x reduction in KV cache memory via low-rank compression. Also introduces DeepSeek Sparse Attention (DSA), which uses a 'lightning indexer' with 8-bit quantization and Hadamard transforms to select only the most relevant tokens, achieving 2-3x faster long-sequence processing with 30-40% less memory while maintaining model performance.
•29m watch time
Sort: