Advanced Prompt Caching at Scale

Prompt caching within a single LLM replica is well-understood, but scaling it across many replicas introduces a cache hit rate problem: under round-robin load balancing, hit probability drops to 1/N. This post covers architectural strategies to fix that, including session affinity routing to pin user sessions to specific replicas, tiered prefix caching (shared system prompts vs. session-specific context), prefix-aware load balancing with consistent hashing for multi-task deployments, and a future-looking shared CPU DRAM cache layer. Latency tradeoffs for each approach are quantified (local VRAM ~0-2ms vs. cross-node ~40-120ms). Practical guidance covers prompt structure discipline (static tokens first), key metrics to monitor (cache hit rate, TTFT, per-replica utilization), and when shared caching becomes worthwhile based on prefix length and recomputation cost.

#vllm

#ai-inference

Apr 07•8m read time•From digitalocean.com

Table of contents

The Single-Replica Ceiling Session Affinity Tiered Prompt Caching for Multi-Task Deployments The Ideal Prompt Caching Architecture Notes on Prompt Structure Best Practices Conclusion

Comment

Bookmark

Copy

Sort: