An experimental exploration of running agentic swarms of small language models on a single NVIDIA DGX Spark (128GB unified memory). Using Gemma 3 270m and 1B models with dedicated Ollama daemons per worker, the author benchmarks process-level concurrency across 1–64 workers. Results show Gemma 3 270m sustains interactive latency at 64 workers (~27,400 aggregate tokens/sec) and Gemma 3 1B at 48 workers (~9,350 tokens/sec), with first-token p95 latency staying nearly flat. The post argues that the future of AI infrastructure may involve swarms of smaller cooperative models rather than single large models, with Cloudflare reportedly planning distributed inference mesh infrastructure.

6m read timeFrom callstack.com
Post cover image
Table of contents
How many small Gemmas can a single DGX Spark serve?Checking the resultsOne caveat: memory reporting on DGX SparkWhat this proves

Sort: