Shepherd Model Gateway (SMG) is an open-source Rust-based LLM serving gateway that disaggregates CPU workloads from GPU inference engines. The core insight is that Python's GIL creates a bottleneck in tokenization, detokenization, and other CPU-bound tasks that forces expensive GPUs to sit idle. SMG moves all CPU work — tokenization with a two-level cache, multimodal preprocessing, tool orchestration, reasoning parsing, chat history management — into a native Rust gRPC layer. Benchmarks on H100 GPUs show up to 3.5x throughput improvement for long-context workloads (7800-token inputs) and ~8% gains at high concurrency. The gateway supports SGLang, vLLM, TensorRT-LLM, and MLX backends, plus external providers like OpenAI and Anthropic. It also implements five agentic APIs natively, WASM middleware plugins, MCP tool orchestration, and enterprise features like mTLS, JWT/OIDC, and distributed rate limiting. SMG is in production at Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI.

12m read timeFrom pytorch.org
Post cover image
Table of contents
How It Started: Hitting the GIL Wall at ScaleThe gRPC Re-Architecture: Making It RealWhat SMG Delivers TodayProving the Thesis: gRPC Gateway BenchmarksThe LandscapeProduction AdoptionWhat’s NextAcknowledgement

Sort: