Paged Attention is a memory optimization technique for LLM inference that borrows virtual paging from operating systems. Traditional KV cache implementations pre-allocate large contiguous memory blocks per request, leading to only 20-38% effective GPU memory utilization due to fragmentation and over-reservation. Paged Attention

6m read time From blog.dailydoseofds.com
Post cover image
Table of contents
A 37-year-old paper is trending now in AI!Paged Attention in LLMs

Sort: