IBM Research's Research Inference & Tuning Service (RITS) platform uses vLLM as its core model serving runtime, supporting over 1,300 active users and 100+ models. The platform leverages vLLM's PagedAttention, continuous batching, and quantization support for GPU efficiency. A hybrid autoscaling model combines serverless 0-to-1 scaling with IBM's Turbonic ARM product using vLLM's 'Requests Waiting' metric for 1-to-n scaling — more effective than simple RPS-based scaling. The platform integrates with Red Hat OpenShift AI and KServe, exposes Prometheus metrics for monitoring, and plans to evolve toward distributed inference with llm-d and IBM Spyre accelerators.
Table of contents
IntroductionThe Business ChallengeHow IBM Research Uses vLLMSolving AI Challenges with vLLMA Word from IBMThe Benefits of Using vLLMLearn MoreSort: