DeepSeek-V3 features major architecture innovations such as Multi-head Latent Attention (MLA), which aids in faster inference by reducing memory usage while maintaining performance. Key concepts include the Key-Value cache, Multi-Query Attention, Grouped-Query Attention, and Rotary Positional Embeddings. MLA compresses attention inputs into low-dimensional latent vectors, enhancing efficiency. The techniques ensure robust language model performance by balancing memory use and modeling capacity.
Table of contents
MHA in Decoder-only TransformersKey-Value CacheMulti-Query Attention (MQA) vs Grouped-Query Attention (GQA)RoPE (Rotary Positional Embeddings)Sort: