Prompt caching within a single LLM replica is well-understood, but scaling it across many replicas introduces a cache hit rate problem: under round-robin load balancing, hit probability drops to 1/N. This post covers architectural strategies to fix that, including session affinity routing to pin user sessions to specific replicas, tiered prefix caching (shared system prompts vs. session-specific context), prefix-aware load balancing with consistent hashing for multi-task deployments, and a future-looking shared CPU DRAM cache layer. Latency tradeoffs for each approach are quantified (local VRAM ~0-2ms vs. cross-node ~40-120ms). Practical guidance covers prompt structure discipline (static tokens first), key metrics to monitor (cache hit rate, TTFT, per-replica utilization), and when shared caching becomes worthwhile based on prefix length and recomputation cost.

8m read timeFrom digitalocean.com
Post cover image
Table of contents
The Single-Replica CeilingSession AffinityTiered Prompt Caching for Multi-Task DeploymentsThe Ideal Prompt Caching ArchitectureNotes on Prompt Structure Best PracticesConclusion

Sort: