A new technique called multi-token prediction has been proposed for language models. It trains the model to predict multiple future tokens simultaneously, leading to better performance for complex tasks. The technique reduces GPU memory usage and has shown promising results in coding and natural language tasks. It mitigates the discrepancy between training and inference, assigns higher weights to critical decision points, and captures longer-term dependencies effectively. Future improvements include automatically determining the optimal value of tokens to predict and exploring alternative prediction losses.
Sort: