New: vLLM in Docker Model Runner. High-throughput inference for safetensors models with auto engine routing for NVIDIA GPUs using Docker.

Docker offers insights into container technology, microservices architecture, and application deployment, providing documentation and best practices for building, deploying, and managing containerized applications. By exploring Docker's curated content, developers can learn about container orchestration, Docker Swarm, and Docker Compose for managing complex containerized environments. Whether you're deploying microservices, building CI/CD pipelines, or optimizing your development workflow, Docker offers resources to streamline your containerization journey and unlock the benefits of container-based deployment.

Docker

Docker Model Runner now integrates vLLM inference engine and safetensors models, enabling high-throughput AI inference on NVIDIA GPUs. The integration automatically routes requests between llama.cpp and vLLM based on model format (GGUF vs safetensors), allowing developers to prototype locally and scale to production using the same Docker commands. Currently available for x86_64 with NVIDIA GPUs, with WSL2/Docker Desktop and DGX compatibility planned.

Docker Model Runner + vLLM: High-Throughput Inference

Expanding Docker Model Runner’s Capabilities

Safetensors (vLLM) vs. GGUF (llama.cpp): Choosing the Right Format