Diffusion LLMs represent a major architectural shift from autoregressive generation. Instead of generating tokens one at a time (which is memory-bandwidth bound), diffusion LLMs start with a fully masked sequence and iteratively unmask all tokens in parallel using bidirectional attention, making inference compute-bound and better suited to modern GPUs. The post covers the math behind masked diffusion, the ELBO training objective, forward and reverse processes, unmasking strategies, block diffusion for KV cache compatibility, and engineering comparisons. Recent models like LLaDA 8B match LLaMA 3 on MMLU and Dream 7B is already in production, suggesting diffusion LLMs are becoming competitive with autoregressive approaches.

2m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
Why care?

Sort: