AI inference at scale is far more complex than simply adding more GPUs. The economics hinge on tokens per second per dollar per watt (TPS/$/W) while meeting service-level targets (goodput). Key factors include software frameworks (vLLM, SGLang, TensorRT LLM), disaggregated serving architectures that split prefill and decode phases across different GPUs, rack-scale systems like Nvidia's NVL72 and AMD's Helios, and lower-precision formats like FP4 that boost throughput but risk accuracy loss. Benchmarks from SemiAnalysis InferenceX show a Pareto tradeoff between bulk low-latency tokens and high-throughput bulk tokens, with a 'goldilocks zone' in between. The inference market is commoditizing rapidly, pushing providers to differentiate through hardware specialization, model customization, or fine-tuning services.
Table of contents
Not all tokens are created equalSoftware mattersDisaggregated ComputeDriving the rack-scale transitionAn unrelenting rate of changeMore levers to pullA race to the bottomSort: