Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models (3B, 8B, 14B text and 8B VLM) that generate text by producing multiple tokens in parallel and iteratively refining them, rather than one token at a time. The models support three inference modes: standard autoregressive, diffusion (FastDiffuser), and self-speculation (LinearSpec), all from the same checkpoint. Self-speculation achieves ~865 tok/s on B200 hardware — roughly 4× the AR baseline — while maintaining lossless output at temperature 0. The models are trained by converting pretrained AR models via continued pretraining with a joint AR+diffusion objective. Deployment is supported via SGLang, and all models, training code, and a technical report are publicly available.

#llm

May 23•6m read time•From huggingface.co

Table of contents

Quick Links to the Models, Training Recipe and Technical Report Three Generation Modes in One Model Performance Highlights How we trained Nemotron-Labs Diffusion Deployment and inference through SGLang Get Started Today

Comment

Bookmark

Copy

Sort: