Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3

Multi-Token Prediction (MTP) in DeepSeek-V3 addresses the limitations of standard next-token autoregressive training by adding auxiliary prediction heads that forecast multiple future tokens simultaneously. Each MTP head combines the hidden state at position t with the embedding of the intermediate token, processes them through a mini-Transformer (using MLA attention and MoE), and projects to vocabulary logits. During training, all predictions are computed in parallel using ground-truth tokens, providing richer gradient signals that encourage forward-looking representations. At inference, only the main head is used, keeping deployment cost identical to a standard LM. The post covers the theoretical motivation, gradient analysis, loss weighting strategies (exponential decay or uniform 0.3 weighting), step-by-step PyTorch implementation, and integration with the broader DeepSeek-V3 architecture including RoPE, MLA, and MoE.

#llm

#mixture-of-experts

Mar 30•15m read time•From pyimagesearch.com

Table of contents

Comment

Bookmark

Copy

Sort: