NVIDIA Nemotron 3 Super, a 120B parameter hybrid MoE model with only 12B active parameters at inference, is now supported on vLLM. Designed for multi-agent AI applications, it features a 1 million token context window to address context explosion and a hybrid Transformer-Mamba architecture delivering up to 4x higher throughput to reduce reasoning costs. NVFP4 precision on Blackwell GPUs achieves 4x higher throughput vs FP8 on H100. Model weights are available on Hugging Face in BF16, FP8, and NVFP4 formats, and can be served via vLLM's OpenAI-compatible API.
Table of contents
About Nemotron 3 SuperRun optimized inference with vLLMHighest efficiency with leading accuracy for multi-agent applicationsGet startedAcknowledgementSort: