vLLM Semantic Router (vLLM-SR) v0.1 introduces Mixture-of-Models (MoM) architecture for intelligent routing across multiple specialized LLMs. Unlike Mixture-of-Experts (MoE) which routes at the token level within a single model, MoM orchestrates independent models at the request level using configurable signals. The live demo on AMD MI300X/MI355X GPUs routes queries across 6 models using 8 signal types (keyword, embedding, domain, language, fact-check, user feedback, preference, latency) and 11 priority-based decision rules. The system matches queries to specialized models—routing math to Qwen3-235B, code to DeepSeek-V3.2, and simple QA to smaller models—while providing safety filtering, reasoning mode control, and semantic caching. Deployment guides for AMD ROCm GPUs are included.

8m read timeFrom blog.vllm.ai
Post cover image
Table of contents
Why System Intelligence for LLMs?Table of ContentsMixture-of-Models vs Mixture-of-ExpertsThe MoM Design PhilosophyLive Demo on AMD GPUsSignal-Based RoutingHow to run it on AMD GPU (MI300X/MI355X)What’s NextResourcesAcknowledgementsJoin Us

Sort: