The Case for Disaggregating CPU from GPU in LLM Serving – PyTorch

Shepherd Model Gateway (SMG) is an open-source Rust-based LLM serving gateway that disaggregates CPU workloads from GPU inference engines. The core insight is that Python's GIL creates a bottleneck in tokenization, detokenization, and other CPU-bound tasks that forces expensive GPUs to sit idle. SMG moves all CPU work — tokenization with a two-level cache, multimodal preprocessing, tool orchestration, reasoning parsing, chat history management — into a native Rust gRPC layer. Benchmarks on H100 GPUs show up to 3.5x throughput improvement for long-context workloads (7800-token inputs) and ~8% gains at high concurrency. The gateway supports SGLang, vLLM, TensorRT-LLM, and MLX backends, plus external providers like OpenAI and Anthropic. It also implements five agentic APIs natively, WASM middleware plugins, MCP tool orchestration, and enterprise features like mTLS, JWT/OIDC, and distributed rate limiting. SMG is in production at Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI.

#rust

#grpc

#vllm

Apr 30•12m read time•From pytorch.org

Table of contents

How It Started: Hitting the GIL Wall at Scale The gRPC Re-Architecture: Making It Real What SMG Delivers Today Proving the Thesis: gRPC Gateway Benchmarks The Landscape Production Adoption What’s Next Acknowledgement

Comment

Bookmark

Copy

Sort: