Paged Attention is a memory optimization technique for LLM inference that borrows virtual paging from operating systems. Traditional KV cache implementations pre-allocate large contiguous memory blocks per request, leading to only 20-38% effective GPU memory utilization due to fragmentation and over-reservation. Paged Attention
•6m read time• From blog.dailydoseofds.com
Sort: