Load balancing for LLMs differs fundamentally from traditional services because of prompt/KV caching. Naive round-robin routing across N replicas reduces cache hit probability to 1/N, degrading the 50-90% cost savings and up to 80% TTFT reduction that caching provides. The post covers routing strategies from round-robin and consistent hashing to cache-aware routing (using radix trees with LRU eviction) and precise prefix cache-aware routing, which consumes real-time KV cache events from engines for accurate routing decisions. It also covers disaggregated prefill/decode serving, hardware selection based on arithmetic intensity, KV cache transfer technologies (NCCL, NIXL, Mooncake), and the future direction of shared cross-replica cache layers.
Table of contents
Inferencing enginesRouting in homogeneous instancesDis-aggregated serving for large sequence lengthsSort: