vLLM inference performance depends on understanding prefill (memory-bandwidth bound) and decode (compute-bound) phases. GPU selection requires calculating VRAM needs based on model weights (parameters × bytes per parameter) plus KV cache (grows with context length and concurrency). Quantization (FP16→INT4) reduces memory by 75%, enabling large models on smaller GPUs. FP8 offers optimal speed-quality balance on modern hardware. Tensor Parallelism pools VRAM across GPUs but adds communication overhead. A 70B FP16 model needs 140GB for weights alone; INT4 quantization reduces this to 35GB. KV cache for 70B with 32k context and 10 users requires ~112GB (FP16) or 56GB (FP8). Always reserve 4-8GB VRAM overhead for system operations.
Table of contents
IntroductionKey TakeawaysThe Anatomy of vLLM Runtime Behavior: Prefill vs. DecodeLinking Phases to Workloads & HardwareKV Cache at RuntimeSizing Fundamentals: How Models, Precision, and Hardware Determine FitGPU Hardware Characteristics & ConstraintsModel Weight Footprint (Static Memory)KV Cache Requirements (Dynamic Memory)Putting the Numbers to the Test: Sizing ScenariosQuantization: The Art of “Squeezing” ModelsPutting It All Together: From Requirements to a Deployment PlanFrequently Asked QuestionsPractical Use-Cases of vLLM GPU InferenceConclusionSort: