LLMs generate text in a two-step process: prefill and decoding. Key metrics for LLM serving include time to first token, time per output token, latency, and throughput. Optimizing LLM inference involves techniques such as operator fusion, quantization, compression, and parallelization. Batch size impacts latency and throughput in LLM inference.
Table of contents
Understanding LLM text generationImportant Metrics for LLM ServingChallenges in LLM InferenceMemory Bandwidth is KeyModel Bandwidth Utilization (MBU)Benchmarking ResultsOptimization Case Study: QuantizationConclusions and Key ResultsSort: