Part 13 of an LLMOps crash course focused on LLM inference and optimization. Covers the prefill and decode phases, KV caching and its optimizations (PagedAttention, prefix caching), attention-level techniques (FlashAttention, GQA), speculative decoding, model parallelism strategies, and hands-on comparisons between vLLM and

2m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
Why care?

Sort: