A systematic guide to running multiple LLM models simultaneously on a single GPU. Covers VRAM calculation formulas (weights + KV cache + overhead), quantization impact on memory footprint, and three orchestration strategies: concurrent loading with VRAM budgeting, hot-swapping via LRU eviction, and Docker container isolation. Includes Python VRAM estimation scripts, Ollama and llama.cpp configuration examples, Docker Compose multi-model setup, GPU sharing options (MIG, MPS, time-slicing), and a decision framework mapping GPU VRAM size and model count to the appropriate strategy. Also addresses silent CPU fallback, KV cache OOM, and CUDA memory fragmentation.

21m read timeFrom sitepoint.com
Post cover image
Table of contents
How to Run Multiple LLM Models on One GPUTable of ContentsWhy Running Multiple Local Models Is HardUnderstanding LLM Memory AnatomyStrategy 1: Concurrent Loading with VRAM BudgetingStrategy 2: Hot-Swapping Models on DemandStrategy 3: Containerized Model Isolation with DockerAdvanced Techniques: Squeezing More from Limited VRAMMonitoring and Debugging Memory IssuesPutting It All Together: Decision FrameworkKey Takeaways

Sort: