DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model featuring 671 billion total parameters with 37 billion activated for each token. Its architecture employs Multi-head Latent Attention (MLA) and DeepSeekMoE, without using auxiliary-loss strategies for load balancing. Trained on 14.8 trillion diverse high-quality

2m read timeFrom arxiv.org
Post cover image

Sort: