We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.

vLLM

NVIDIA Nemotron 3 Super, a 120B parameter hybrid MoE model with only 12B active parameters at inference, is now supported on vLLM. Designed for multi-agent AI applications, it features a 1 million token context window to address context explosion and a hybrid Transformer-Mamba architecture delivering up to 4x higher throughput to reduce reasoning costs. NVFP4 precision on Blackwell GPUs achieves 4x higher throughput vs FP8 on H100. Model weights are available on Hugging Face in BF16, FP8, and NVFP4 formats, and can be served via vLLM's OpenAI-compatible API.

Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM

Highest efficiency with leading accuracy for multi-agent applications