Multi-Head Latent Attention (MLA) in DeepSeek-V3 reduces memory usage by over 80% compared to traditional Multi-Head Attention by storing compressed key-value representations instead of full matrices. The technique uses low-rank compression and dynamic reconstruction, enabling 128,000 token context windows while maintaining
Sort: