A data-backed exploration of running LLMs locally on NVIDIA's DGX Spark (GB10 Grace Blackwell superchip) with 128GB unified memory. Using vLLM with various quantized models from 1.5B to 14B parameters, benchmarks show the 1.5B model achieves 61.73 tokens/sec while the 14B NVFB4 (4-bit floating-point quantization) still hits 20.19 tokens/sec — 3.4x faster time-to-first-token than the unquantized 14B base model. Key insight: memory capacity ≠ memory bandwidth, and quantization format is as important as hardware choice. The setup uses Docker-isolated, reproducible benchmarking with automated warm-up runs and GPU metrics logging, enabling a local-first workflow that mirrors production data center environments.

10m watch time

Sort: