Daily Dose of DS offers a daily dose of inspiration, education, and motivation for data scientists and aspiring data professionals. Through bite-sized articles, tutorials, and curated resources, readers embark on a journey to master the art and science of data analysis, machine learning, and artificial intelligence. By staying updated with the latest trends, techniques, and tools in data science, readers can hone their skills and stay ahead in this rapidly evolving field.

Daily Dose of Data Science | Avi Chawla | Substack

Paged Attention is a memory optimization technique for LLM inference that borrows virtual paging from operating systems. Traditional KV cache implementations pre-allocate large contiguous memory blocks per request, leading to only 20-38% effective GPU memory utilization due to fragmentation and over-reservation. Paged Attention divides the KV cache into small fixed-size blocks (typically 16 tokens) that can be scattered anywhere in GPU memory, with a block table mapping logical to physical locations. Multiple requests sharing the same system prompt can point their block tables to the same physical blocks, eliminating duplicate storage. This approach achieves 2-4x higher throughput at equivalent latency and near-zero memory waste. vLLM implements Paged Attention as its core algorithm, and similar mechanisms have been adopted by TensorRT-LLM and SGLang.

Paged Attention in LLMs

A 37-year-old paper is trending now in AI!