NVIDIA's Nemotron 3 Nano Omni is a new open multimodal model with a hybrid MoE Transformer-Mamba architecture featuring 30B total parameters but only 3B active per forward pass. It unifies vision, audio, and language perception in a single model, eliminating the need for fragmented multimodal stacks in agentic workflows. The model achieves up to 9.2x higher throughput than comparable open omni models, supports FP8 and NVFP4 quantization, and handles 256K context. vLLM now supports serving it via an OpenAI-compatible API, with cookbooks and Brev launchables available for quick deployment. It tops six multimodal leaderboards covering document intelligence, video, OCR, and audio benchmarks.

6m read timeFrom vllm.ai
Post cover image
Table of contents
TL;DR: About Nemotron 3 Nano OmniRun Optimized Multimodal Inference with vLLMHighest Efficiency with Leading Accuracy for Multimodal Agentic ApplicationsGet StartedAcknowledgement

Sort: