NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models (3B, 8B, 14B text and 8B VLM) that generate text by producing multiple tokens in parallel and iteratively refining them, rather than one token at a time. The models support three inference modes: standard autoregressive, diffusion (FastDiffuser), and self-speculation (LinearSpec), all from the same checkpoint. Self-speculation achieves ~865 tok/s on B200 hardware — roughly 4× the AR baseline — while maintaining lossless output at temperature 0. The models are trained by converting pretrained AR models via continued pretraining with a joint AR+diffusion objective. Deployment is supported via SGLang, and all models, training code, and a technical report are publicly available.

6m read timeFrom huggingface.co
Post cover image
Table of contents
Quick Links to the Models, Training Recipe and Technical ReportThree Generation Modes in One ModelPerformance HighlightsHow we trained Nemotron-Labs DiffusionDeployment and inference through SGLangGet Started Today

Sort: