NVIDIA released Nemotron 3 Nano, a 30B parameter MoE model that runs on 24GB VRAM with only 3.6B active parameters during inference. The model features a 1M context window, built-in reasoning with special tokens, and native tool calling. Setup involves cloning llama.cpp, building with CUDA, and pulling the GGUF from Hugging

Sort: