Abstract page for arXiv paper 2412.19437: DeepSeek-V3 Technical Report

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model featuring 671 billion total parameters with 37 billion activated for each token. Its architecture employs Multi-head Latent Attention (MLA) and DeepSeekMoE, without using auxiliary-loss strategies for load balancing. Trained on 14.8 trillion diverse high-quality tokens, followed by fine-tuning and reinforcement learning, DeepSeek-V3 outperforms other open-source models and achieves performance comparable to top closed-source models, requiring 2.788M H800 GPU hours for training.

[2412.19437] DeepSeek-V3 Technical Report