Diffusion LLMs (dLLMs) offer an alternative to autoregressive text generation by starting with a fully masked sequence and unmasking tokens in parallel using bidirectional attention, shifting inference from memory-bandwidth bound to compute-bound. Part 2 of this deep dive covers scaling training from 8B to 100B parameters, converting pre-trained autoregressive models like LLaMA into diffusion models via attention mask annealing, inference acceleration techniques (block-wise KV caching, confidence-aware parallel decoding, token editing), production serving with SGLang, and hands-on code for running Dream 7B and LLaDA 2.0. Benchmark results show LLaDA 8B matching LLaMA 3 on MMLU and exceeding it on TruthfulQA and HumanEval.
Table of contents
Why care?Sort: