NVIDIA has released Nemotron 3 Nano Omni, a 30B-A3B hybrid Mixture-of-Experts open model designed to unify multimodal reasoning across text, image, video, and audio in a single model. It replaces fragmented vision-language-audio stacks in agentic systems, reducing orchestration complexity and inference costs. The model achieves top scores on document intelligence benchmarks (MMlongbench-Doc, OCRBenchV2) and video/audio understanding benchmarks (WorldSense, DailyOmni, VoiceBench), while delivering up to 9.2× greater system throughput for video reasoning compared to alternative open omni models. Built on a hybrid Mamba+transformer MoE architecture with FP8/NVFP4 quantization support, it is fully open-source with weights, datasets, and training recipes available on Hugging Face. Deployment cookbooks for vLLM, SGLang, TensorRT-LLM, and Dynamo are provided, along with fine-tuning recipes using LoRA SFT and GRPO/DAPO via NeMo RL.
Table of contents
Best-in-class efficiency and accuracyWhat’s under the hood of Nemotron 3 Nano Omni?Open by design: Weights, data, and recipesClaws powered by Nemotron 3 Nano OmniGet started with Nemotron 3 Nano OmniSort: