Synchronous continuous batching leaves the GPU idle for nearly 25% of runtime while the CPU prepares the next batch. By introducing CUDA streams (H2D, compute, D2H), CUDA events for cross-stream ordering, double-buffered input slots to avoid race conditions, and a carry-over mechanism to propagate output tokens as inputs for the next batch, CPU and GPU work can overlap almost entirely. On an 8B model generating 8K tokens at batch size 32, this raises GPU utilization from 76% to 99.4% and cuts generation time by 22% — with no model or kernel changes. The implementation is available in the Hugging Face Transformers library.
Table of contents
Synchronous batchingCreating concurrencyEnforcing synchronizationFilling the vacuumThe full async loopDoes it actually work?ConclusionSort: