NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models (3B, 8B, 14B text and 8B VLM) that generate text by producing multiple tokens in parallel and iteratively refining them, rather than one token at a time. The models support three inference modes: standard autoregressive, diffusion (FastDiffuser), and self-speculation (LinearSpec), all from the same checkpoint. Self-speculation achieves ~865 tok/s on B200 hardware — roughly 4× the AR baseline — while maintaining lossless output at temperature 0. The models are trained by converting pretrained AR models via continued pretraining with a joint AR+diffusion objective. Deployment is supported via SGLang, and all models, training code, and a technical report are publicly available.
Table of contents
Quick Links to the Models, Training Recipe and Technical ReportThree Generation Modes in One ModelPerformance HighlightsHow we trained Nemotron-Labs DiffusionDeployment and inference through SGLangGet Started TodaySort: