NVIDIA has released Nemotron 3 Nano Omni, a 30B-parameter omni-modal model supporting text, image, video, and audio understanding. Built on a hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, it targets five workloads: long document analysis (100+ pages), automatic speech recognition, long audio-video understanding, agentic computer use (GUI automation), and general multimodal reasoning. Key architectural innovations include dynamic resolution processing (up to 13,312 visual patches per image), Conv3D tubelet embedding for video token compression, and Efficient Video Sampling (EVS) to prune redundant frames at inference. Training used staged multimodal alignment, preference optimization, and multimodal reinforcement learning across H100/B200 clusters. The model leads several benchmarks including MMlongbench-Doc (57.5), VoiceBench (89.4), and WorldSense (55.4), while delivering up to 9.2x higher system efficiency for video use cases compared to alternatives. BF16, FP8, and NVFP4 checkpoints are available on Hugging Face.

15m read timeFrom huggingface.co
Post cover image
Table of contents
What Nemotron 3 Nano Omni is designed forModel architecture and key innovationsTraining data, infrastructure and systems storyExample workflowsGetting started with Nemotron 3 Nano OmniReferences

Sort: