vLLM

vLLM Semantic Router (vLLM-SR) v0.1 introduces Mixture-of-Models (MoM) architecture for intelligent routing across multiple specialized LLMs. Unlike Mixture-of-Experts (MoE) which routes at the token level within a single model, MoM orchestrates independent models at the request level using configurable signals. The live demo on AMD MI300X/MI355X GPUs routes queries across 6 models using 8 signal types (keyword, embedding, domain, language, fact-check, user feedback, preference, latency) and 11 priority-based decision rules. The system matches queries to specialized models—routing math to Qwen3-235B, code to DeepSeek-V3.2, and simple QA to smaller models—while providing safety filtering, reasoning mode control, and semantic caching. Deployment guides for AMD ROCm GPUs are included.

Building Mixture-of-Models on AMD GPUs with vLLM-SR