Unpacking the deceptively simple science of tokenomics

AI inference at scale is far more complex than simply adding more GPUs. The economics hinge on tokens per second per dollar per watt (TPS/$/W) while meeting service-level targets (goodput). Key factors include software frameworks (vLLM, SGLang, TensorRT LLM), disaggregated serving architectures that split prefill and decode phases across different GPUs, rack-scale systems like Nvidia's NVL72 and AMD's Helios, and lower-precision formats like FP4 that boost throughput but risk accuracy loss. Benchmarks from SemiAnalysis InferenceX show a Pareto tradeoff between bulk low-latency tokens and high-throughput bulk tokens, with a 'goldilocks zone' in between. The inference market is commoditizing rapidly, pushing providers to differentiate through hardware specialization, model customization, or fine-tuning services.

#llm

#nvidia

#gpu

#ai-inference

Mar 07•12m read time•From go.theregister.com

Table of contents

Not all tokens are created equal Software matters Disaggregated Compute Driving the rack-scale transition An unrelenting rate of change More levers to pull A race to the bottom

Comment

Bookmark

Copy

Sort: