Why We’ve Been Optimizing the Wrong Thing in LLMs for Years
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Multi-Token Prediction (MTP) challenges the standard next-token prediction approach in LLMs by training models to predict multiple future tokens simultaneously. Research shows transformers already encode future text trajectories in their hidden states, which MTP explicitly leverages as a training objective. The architecture uses a shared trunk with independent prediction heads, achieving up to 3x inference speedup through self-speculation and 17% better performance on coding benchmarks for larger models. While MTP excels at reasoning tasks, it underperforms on knowledge retrieval benchmarks. DeepSeek-V3 successfully deployed MTP in production, validating its practical benefits for improving reasoning capabilities and inference efficiency.
Table of contents
IntroductionMotivation For MTPThe MTP Architecture: Parallelizing PredictionOvercoming the Memory BottleneckCritical Design ChoicesExperimental Results: The Scale of ImprovementThe Price of Foresight: Shortcomings and Trade-offsConclusionReferencesSort: