DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model featuring 671 billion total parameters with 37 billion activated for each token. Its architecture employs Multi-head Latent Attention (MLA) and DeepSeekMoE, without using auxiliary-loss strategies for load balancing. Trained on 14.8 trillion diverse high-quality tokens, followed by fine-tuning and reinforcement learning, DeepSeek-V3 outperforms other open-source models and achieves performance comparable to top closed-source models, requiring 2.788M H800 GPU hours for training.
Sort: