llama.cpp server now supports router mode, enabling dynamic loading, unloading, and switching between multiple LLM models without restarting. The feature auto-discovers GGUF models from cache or custom directories, loads them on-demand, and uses LRU eviction when hitting the concurrent model limit (default: 4). Each model runs in its own process for isolation. The server provides OpenAI-compatible HTTP endpoints for chat completions and model management operations.
Sort: