Red Hat's team faced a costly challenge when benchmarking LLMs across 7,488 combinations of models, GPUs, and load levels. Over 50% of their 4,506 test runs became invalid due to oversaturation—when servers can't handle incoming request loads, causing queued requests and skewed metrics. The team uses vLLM for inference, GuideLLM for load simulation and measurement, and JBenchmark for orchestration. Traditional load testing approaches don't work well for LLM benchmarking due to streaming requests, long processing times, and expensive GPU hardware. The solution requires detecting oversaturation early using LLM-specific metrics like Time-to-First-Token and Inter-Token Latency, treating it as an anomaly detection problem rather than applying static thresholds.

Table of contents
The problem of oversaturationOur stack: GuideLLM, VLLM, and JBenchmarkDoesn't oversaturation detection already have a solution?Oversaturation detection is not trivialNext stepsSort: