Load balancing for LLMs differs fundamentally from traditional services because of prompt/KV caching. Naive round-robin routing across N replicas reduces cache hit probability to 1/N, degrading the 50-90% cost savings and up to 80% TTFT reduction that caching provides. The post covers routing strategies from round-robin and
Table of contents
Inferencing enginesRouting in homogeneous instancesDis-aggregated serving for large sequence lengthsSort: