DeepSeek-V3 features major architecture innovations such as Multi-head Latent Attention (MLA), which aids in faster inference by reducing memory usage while maintaining performance. Key concepts include the Key-Value cache, Multi-Query Attention, Grouped-Query Attention, and Rotary Positional Embeddings. MLA compresses attention inputs into low-dimensional latent vectors, enhancing efficiency. The techniques ensure robust language model performance by balancing memory use and modeling capacity.

10m read timeFrom towardsdatascience.com
Post cover image
Table of contents
MHA in Decoder-only TransformersKey-Value CacheMulti-Query Attention (MQA) vs Grouped-Query Attention (GQA)RoPE (Rotary Positional Embeddings)

Sort: