unknown

NVIDIA released Nemotron 3 Nano, a 30B parameter MoE model that runs on 24GB VRAM with only 3.6B active parameters during inference. The model features a 1M context window, built-in reasoning with special tokens, and native tool calling. Setup involves cloning llama.cpp, building with CUDA, and pulling the GGUF from Hugging Face. While it competes with models 3x its size, users should limit context to 32K-64K to avoid out-of-memory errors, as the full 1M context requires more VRAM than a single 24GB card provides.

30B parameters running on 24GB. Not a typo. NVIDIA AI dropped a banger MoE model. Nemotron 3 Nano. Runs on 24GB. Only 3.6B active during inference. 1M context window. I ran it on my DGX Spark…