30B parameters running on 24GB. Not a typo. NVIDIA AI dropped a banger MoE model. Nemotron 3 Nano. Runs on 24GB. Only 3.6B active during inference. 1M context window. I ran it on my DGX Spark…Read post
NVIDIA released Nemotron 3 Nano, a 30B parameter MoE model that runs on 24GB VRAM with only 3.6B active parameters during inference. The model features a 1M context window, built-in reasoning with special tokens, and native tool calling. Setup involves cloning llama.cpp, building with CUDA, and pulling the GGUF from Hugging
Sort: