llama.cpp server now supports router mode, enabling dynamic loading, unloading, and switching between multiple LLM models without restarting. The feature auto-discovers GGUF models from cache or custom directories, loads them on-demand, and uses LRU eviction when hitting the concurrent model limit (default: 4). Each model runs in its own process for isolation. The server provides OpenAI-compatible HTTP endpoints for chat completions and model management operations.

3m read timeFrom huggingface.co
Post cover image
Table of contents
Quick StartFeaturesExamplesKey OptionsJoin the Conversation

Sort: