A hands-on look at agentic swarms, distributed inference, and what small-model benchmarks suggest about future AI infrastructure.

Callstack Blog

An experimental exploration of running agentic swarms of small language models on a single NVIDIA DGX Spark (128GB unified memory). Using Gemma 3 270m and 1B models with dedicated Ollama daemons per worker, the author benchmarks process-level concurrency across 1–64 workers. Results show Gemma 3 270m sustains interactive latency at 64 workers (~27,400 aggregate tokens/sec) and Gemma 3 1B at 48 workers (~9,350 tokens/sec), with first-token p95 latency staying nearly flat. The post argues that the future of AI infrastructure may involve swarms of smaller cooperative models rather than single large models, with Cloudflare reportedly planning distributed inference mesh infrastructure.

Agencies or Swarms? What Small Model Cooperation Means for AI Engineering

How many small Gemmas can a single DGX Spark serve?

One caveat: memory reporting on DGX Spark