A step-by-step guide to deploying multiple LLMs behind a single OpenAI-compatible endpoint on OpenShift using a Model-as-a-Service (MaaS) pattern. The architecture combines llm-d, Gateway API Inference Extension (GAIE), and agentgateway to route inference requests based on the model field in the request body. The guide covers installing CRDs, configuring body-based routing with AgentgatewayPolicy using CEL expressions, deploying InferencePools with intelligent endpoint picking (EPP) based on KV cache utilization and queue depth, and deploying vLLM model servers via Helm. It also covers exposing the gateway externally on OpenShift and troubleshooting common issues.

Table of contents
The componentsUnderstanding the LLM routing traffic flowBefore you beginDeploying the stackVerify the deploymentAdd a new modelWhat's nextAlternative gateway providersLearn moreSort: