Continuous batching is an optimization technique for serving large language models that maximizes throughput by combining three key strategies: KV caching to avoid recomputing past token representations, chunked prefill to handle variable-length prompts within memory constraints, and ragged batching with dynamic scheduling to eliminate padding waste. By removing the traditional batch dimension and using attention masks to control token interactions, continuous batching allows mixing prefill and decode phases in the same batch, enabling efficient processing of multiple concurrent requests with different sequence lengths.
Sort: