A new research paper introduces LaDDer (Large Language Diffusion with Masking), a diffusion-based alternative to autoregressive language models. Instead of generating tokens sequentially left-to-right, LaDDer uses a masking-based diffusion process: during training, tokens are randomly masked and a Transformer-based mask predictor learns to restore them. At inference, the model iteratively unmasks a fully masked response using reverse diffusion, with remasking strategies based on prediction confidence or semi-autoregressive block processing. Trained on 2.3 trillion tokens for pre-training and 4.5 million samples for supervised fine-tuning, the 8B parameter LaDDer model is competitive with LLaMA 3 on several benchmarks, shows strong scalability on math tasks (GSM8K), and notably outperforms GPT-4o and Qwen 2.5 on reversal poem completion — a task where autoregressive models inherently struggle due to left-to-right constraints.

9m watch time

Sort: