IBM Research's Research Inference & Tuning Service (RITS) platform uses vLLM as its core model serving runtime, supporting over 1,300 active users and 100+ models. The platform leverages vLLM's PagedAttention, continuous batching, and quantization support for GPU efficiency. A hybrid autoscaling model combines serverless 0-to-1 scaling with IBM's Turbonic ARM product using vLLM's 'Requests Waiting' metric for 1-to-n scaling — more effective than simple RPS-based scaling. The platform integrates with Red Hat OpenShift AI and KServe, exposes Prometheus metrics for monitoring, and plans to evolve toward distributed inference with llm-d and IBM Spyre accelerators.

7m read timeFrom pytorch.org
Post cover image
Table of contents
IntroductionThe Business ChallengeHow IBM Research Uses vLLMSolving AI Challenges with vLLMA Word from IBMThe Benefits of Using vLLMLearn More

Sort: