LLM inference presents a three-way tension between throughput, latency, and cost that can't be resolved by simply adding more servers. Cost in LLM hosting has four dimensions: hardware acquisition, operational expenses (power/cooling), idle GPU waste, and engineering labor. Key optimization levers include model architecture choice (dense vs. MoE), quantization (FP8 is now the production baseline), parallelism strategy (tensor vs. data vs. expert parallelism), and batch size tuning. The article explains the roofline model, Little's Law for capacity planning, and provides a 7-step decision framework: characterize workload, select model/quantization, benchmark on hardware, find the performance knee, size deployment, calculate TCO per token, and plan autoscaling. Latency-sensitive workloads (chat, code completion) need moderate batch sizes and tensor parallelism, while throughput-heavy workloads (batch summarization, offline pipelines) should maximize batch size and use aggressive quantization.
Table of contents
Classic case of “Trilemma”What Does “Cost” Actually Mean in LLM InferenceThe Levers That Dictate CostWhen to Optimize for Throughput vs. LatencyA Decision FrameworkBuild for Your Workload, Not the BenchmarkReferencesSort: