We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.

vLLM

NVIDIA's Nemotron 3 Nano Omni is a new open multimodal model with a hybrid MoE Transformer-Mamba architecture featuring 30B total parameters but only 3B active per forward pass. It unifies vision, audio, and language perception in a single model, eliminating the need for fragmented multimodal stacks in agentic workflows. The model achieves up to 9.2x higher throughput than comparable open omni models, supports FP8 and NVFP4 quantization, and handles 256K context. vLLM now supports serving it via an OpenAI-compatible API, with cookbooks and Brev launchables available for quick deployment. It tops six multimodal leaderboards covering document intelligence, video, OCR, and audio benchmarks.

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

Run Optimized Multimodal Inference with vLLM

Highest Efficiency with Leading Accuracy for Multimodal Agentic Applications