Learn how to deploy multiple large language models (LLMs) behind a single OpenAI-compatible endpoint on OpenShift using a Model-as-a-Service (MaaS) approach. This guide demonstrates how to build an intelligent routing infrastructure that dynamically inspects the request payload and directs traffic based on the specified model field, reducing GPU waste and simplifying application logic.

Rhdev is a blog and resource hub dedicated to Ruby on Rails development, a popular web application framework written in Ruby. Developers can explore tutorials, best practices, and case studies for building web applications with Ruby on Rails. Additionally, Rhdev covers topics such as ActiveRecord ORM, RESTful APIs, and frontend integration using JavaScript frameworks, offering insights for both beginners and experienced Rails developers.

Red Hat Developer

A step-by-step guide to deploying multiple LLMs behind a single OpenAI-compatible endpoint on OpenShift using a Model-as-a-Service (MaaS) pattern. The architecture combines llm-d, Gateway API Inference Extension (GAIE), and agentgateway to route inference requests based on the model field in the request body. The guide covers installing CRDs, configuring body-based routing with AgentgatewayPolicy using CEL expressions, deploying InferencePools with intelligent endpoint picking (EPP) based on KV cache utilization and queue depth, and deploying vLLM model servers via Helm. It also covers exposing the gateway externally on OpenShift and troubleshooting common issues.

Run Model-as-a-Service for multiple LLMs on OpenShift

Understanding the LLM routing traffic flow