A new technique called multi-token prediction has been proposed for language models. It trains the model to predict multiple future tokens simultaneously, leading to better performance for complex tasks. The technique reduces GPU memory usage and has shown promising results in coding and natural language tasks. It mitigates the
Sort: