The LLM Inference Trilemma: Throughput, Latency, Cost

LLM inference presents a three-way tension between throughput, latency, and cost that can't be resolved by simply adding more servers. Cost in LLM hosting has four dimensions: hardware acquisition, operational expenses (power/cooling), idle GPU waste, and engineering labor. Key optimization levers include model architecture choice (dense vs. MoE), quantization (FP8 is now the production baseline), parallelism strategy (tensor vs. data vs. expert parallelism), and batch size tuning. The article explains the roofline model, Little's Law for capacity planning, and provides a 7-step decision framework: characterize workload, select model/quantization, benchmark on hardware, find the performance knee, size deployment, calculate TCO per token, and plan autoscaling. Latency-sensitive workloads (chat, code completion) need moderate batch sizes and tensor parallelism, while throughput-heavy workloads (batch summarization, offline pipelines) should maximize batch size and use aggressive quantization.

#data-science

#ai-inference

#vllm

Apr 22•11m read time•From digitalocean.com

Table of contents

Classic case of “Trilemma”What Does “Cost” Actually Mean in LLM Inference The Levers That Dictate Cost When to Optimize for Throughput vs. Latency A Decision Framework Build for Your Workload, Not the Benchmark References

Comment

Bookmark

Copy

Sort: