databricks

NVIDIA's Multi-Process Service (MPS) can significantly improve GPU throughput for small language models by allowing multiple inference processes to share GPU resources and overlap operations. Rigorous testing shows MPS delivers 50-100%+ throughput gains for models ≤3B parameters with short contexts (<2k tokens) and prefill-heavy workloads, primarily by enabling kernel overlap during attention operations when individual engines underutilize compute or memory bandwidth. The benefits diminish rapidly for larger models (7B+) or longer contexts (>8k tokens) where GPUs are already saturated. MPS also helps recover GPU time lost to CPU bottlenecks like scheduler overhead. However, it introduces operational complexity including daemon management, debugging challenges, and fragile failure modes where one misbehaving process can affect all co-located engines.

Scaling Small LLMs with NVIDIA MPS

The Scaling Landscape: When Does MPS Help?

Dissecting the Gains: Where Do MPS Benefits Really Come From?