Moving LLM workloads from the cloud to local infrastructure requires a shift in engineering strategy. In this talk, I share my journey of serving and benchmarking open-source models (1.5B to 14B) on an NVIDIA DGX Spark workstation. Using a reproducible methodology with vLLM, I analyze real-world trade-offs in throughput, latency, and the benefits of the 128GB Grace Blackwell unified memory architecture. You will leave with a clear framework for local model sizing, an understanding of quantization performance like NVFP4, and a guide for when local compute is the right choice for your AI stack.

Speaker info:
- LinkedIn https://www.linkedin.com/in/mozhgankch/

AI Engineer

A data-backed exploration of running LLMs locally on NVIDIA's DGX Spark (GB10 Grace Blackwell superchip) with 128GB unified memory. Using vLLM with various quantized models from 1.5B to 14B parameters, benchmarks show the 1.5B model achieves 61.73 tokens/sec while the 14B NVFB4 (4-bit floating-point quantization) still hits 20.19 tokens/sec — 3.4x faster time-to-first-token than the unquantized 14B base model. Key insight: memory capacity ≠ memory bandwidth, and quantization format is as important as hardware choice. The setup uses Docker-isolated, reproducible benchmarking with automated warm-up runs and GPU metrics logging, enabling a local-first workflow that mirrors production data center environments.

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA