Learn how to set up vLLM Semantic Router locally with two models: a quantized Qwen3-Coder-Next running on Apple Silicon, and Google's Gemini 2.5 Pro as the cloud fallback. This router can significantly reduce token costs by routing common requests to a less expensive model.

Rhdev is a blog and resource hub dedicated to Ruby on Rails development, a popular web application framework written in Ruby. Developers can explore tutorials, best practices, and case studies for building web applications with Ruby on Rails. Additionally, Rhdev covers topics such as ActiveRecord ORM, RESTful APIs, and frontend integration using JavaScript frameworks, offering insights for both beginners and experienced Rails developers.

Red Hat Developer

vLLM Semantic Router's Athena 0.2 release is an open source LLM request router that sits between clients and model backends, intelligently routing requests to either a local model or a cloud model based on complexity. The tutorial walks through setting up the router locally with a quantized Qwen3-Coder-Next 80B model on Apple Silicon via mlx-lm and Google Gemini 2.5 Pro as a cloud fallback. Configuration involves defining providers, keyword-based signals, and priority-ordered routing decisions in YAML. The router exposes an OpenAI-compatible API so no client changes are needed. Benchmarks showed 86% of requests staying local, with routing latency adding only 28–93ms overhead. Advanced features include embedding-based signals, neural classifiers (mmBERT), HNSW semantic caching, jailbreak/PII detection, and Kubernetes CRD support for production deployments.

Getting started with the vLLM Semantic Router project's Athena release: Optimize your tokens for agentic AI