Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA has released Nemotron 3 Nano Omni, a 30B-parameter omni-modal model supporting text, image, video, and audio understanding. Built on a hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, it targets five workloads: long document analysis (100+ pages), automatic speech recognition, long audio-video understanding, agentic computer use (GUI automation), and general multimodal reasoning. Key architectural innovations include dynamic resolution processing (up to 13,312 visual patches per image), Conv3D tubelet embedding for video token compression, and Efficient Video Sampling (EVS) to prune redundant frames at inference. Training used staged multimodal alignment, preference optimization, and multimodal reinforcement learning across H100/B200 clusters. The model leads several benchmarks including MMlongbench-Doc (57.5), VoiceBench (89.4), and WorldSense (55.4), while delivering up to 9.2x higher system efficiency for video use cases compared to alternatives. BF16, FP8, and NVFP4 checkpoints are available on Hugging Face.

#machine-learning

#llm

#nvidia

#computer-vision

#multimodal

Apr 28•15m read time•From huggingface.co

Table of contents

What Nemotron 3 Nano Omni is designed for Model architecture and key innovations Training data, infrastructure and systems story Example workflows Getting started with Nemotron 3 Nano Omni References

Comment

Bookmark

Copy

Sort: