Paged Attention is a memory optimization technique for LLM inference that borrows virtual paging from operating systems. Traditional KV cache implementations pre-allocate large contiguous memory blocks per request, leading to only 20-38% effective GPU memory utilization due to fragmentation and over-reservation. Paged Attention divides the KV cache into small fixed-size blocks (typically 16 tokens) that can be scattered anywhere in GPU memory, with a block table mapping logical to physical locations. Multiple requests sharing the same system prompt can point their block tables to the same physical blocks, eliminating duplicate storage. This approach achieves 2-4x higher throughput at equivalent latency and near-zero memory waste. vLLM implements Paged Attention as its core algorithm, and similar mechanisms have been adopted by TensorRT-LLM and SGLang.

6m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
A 37-year-old paper is trending now in AI!Paged Attention in LLMs

Sort: