NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model

NVIDIA has released Nemotron 3 Nano Omni, a 30B-A3B hybrid Mixture-of-Experts open model designed to unify multimodal reasoning across text, image, video, and audio in a single model. It replaces fragmented vision-language-audio stacks in agentic systems, reducing orchestration complexity and inference costs. The model achieves top scores on document intelligence benchmarks (MMlongbench-Doc, OCRBenchV2) and video/audio understanding benchmarks (WorldSense, DailyOmni, VoiceBench), while delivering up to 9.2× greater system throughput for video reasoning compared to alternative open omni models. Built on a hybrid Mamba+transformer MoE architecture with FP8/NVFP4 quantization support, it is fully open-source with weights, datasets, and training recipes available on Hugging Face. Deployment cookbooks for vLLM, SGLang, TensorRT-LLM, and Dynamo are provided, along with fine-tuning recipes using LoRA SFT and GRPO/DAPO via NeMo RL.

#llm

#nvidia

#agentic-ai

#multimodal

#mixture-of-experts

Apr 28•11m read time•From developer.nvidia.com

Table of contents

Best-in-class efficiency and accuracy What’s under the hood of Nemotron 3 Nano Omni?Open by design: Weights, data, and recipes Claws powered by Nemotron 3 Nano Omni Get started with Nemotron 3 Nano Omni

Comment

Bookmark

Copy

Sort: