DigitalOcean's Inference Router automatically routes LLM requests to the best-fit model based on task type, cost, and latency—eliminating the need for hardcoded routing logic in application code. Built on Plano, an open-source AI-native proxy, it uses purpose-built small language models (Arch-Router 1.5B and Plano-Orchestrator up to 30B MoE) to classify intent from conversation context in ~200ms. The ranking engine uses live cost and latency data from DigitalOcean's pricing API and Prometheus to order candidate models dynamically. The architecture layers Envoy for connection handling, a Rust-based WASM filter for provider format translation, and a native Rust binary (Brightstaff) for routing logic. Key lessons include: purpose-built routing models outperform frontier models on narrow tasks, task description quality is critical for routing accuracy, and provider latency varies 2-3x throughout the day requiring live metrics. Available as a managed service on DigitalOcean or self-hosted via the open-source Plano project.
Table of contents
DigitalOcean’s Inference RouterHow It Works: Plano Under the HoodThe Routing ModelThe Ranking Engine: Live Cost and Latency DataUnder the Hood: Envoy, WASM, and Async RustGetting StartedWhat We LearnedWhat We’re Exploring Next1 Comment
Sort: