Multi-Token Prediction (MTP) in DeepSeek-V3 addresses the limitations of standard next-token autoregressive training by adding auxiliary prediction heads that forecast multiple future tokens simultaneously. Each MTP head combines the hidden state at position t with the embedding of the intermediate token, processes them through a mini-Transformer (using MLA attention and MoE), and projects to vocabulary logits. During training, all predictions are computed in parallel using ground-truth tokens, providing richer gradient signals that encourage forward-looking representations. At inference, only the main head is used, keeping deployment cost identical to a standard LM. The post covers the theoretical motivation, gradient analysis, loss weighting strategies (exponential decay or uniform 0.3 weighting), step-by-step PyTorch implementation, and integration with the broader DeepSeek-V3 architecture including RoPE, MLA, and MoE.
Table of contents
Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3Why Next-Token Prediction Limits DeepSeek-V3Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens AheadDeepSeek-V3 Architecture: Multi-Token Prediction Heads ExplainedGradient Insights for Multi-Token Prediction in DeepSeek-V3DeepSeek-V3 Training vs. Inference: How MTP Changes BothMulti-Token Prediction Loss Weighting and Decay for DeepSeek-V3Step-by-Step Implementation of Multi-Token Prediction Heads in DeepSeek-V3Integrating Multi-Token Prediction with DeepSeek-V3’s Core TransformerTheoretical Foundations: MTP, Curriculum Learning, and Auxiliary TasksMulti-Token Prediction Benefits: Coherence, Planning, and Faster ConvergenceSummarySort: