Part 13 of an LLMOps crash course focused on LLM inference and optimization. Covers the prefill and decode phases, KV caching and its optimizations (PagedAttention, prefix caching), attention-level techniques (FlashAttention, GQA), speculative decoding, model parallelism strategies, and hands-on comparisons between vLLM and standard inference. Emphasizes that inference optimization is critical for production deployments where costs, latency, and memory constraints determine whether a model is actually usable at scale.
Table of contents
Why care?Sort: