Run vLLM on your Mac with Docker Model Runner. The vllm-metal backend enables high-performance LLM inference on Apple Silicon with Metal GPU acceleration.

Docker offers insights into container technology, microservices architecture, and application deployment, providing documentation and best practices for building, deploying, and managing containerized applications. By exploring Docker's curated content, developers can learn about container orchestration, Docker Swarm, and Docker Compose for managing complex containerized environments. Whether you're deploying microservices, building CI/CD pipelines, or optimizing your development workflow, Docker offers resources to streamline your containerization journey and unlock the benefits of container-based deployment.

Docker

Docker Model Runner now supports vllm-metal, a new backend enabling vLLM inference on macOS with Apple Silicon's Metal GPU. Developed collaboratively by Docker and the vLLM project, vllm-metal unifies MLX and PyTorch under a single compute pathway, leveraging Apple Silicon's unified memory for zero-copy tensor operations and paged attention for efficient KV cache management. The backend runs natively on the host (not in containers) due to Metal GPU access requirements, and is compatible with MLX-format safetensors models from the mlx-community on Hugging Face. Docker has open-sourced vllm-metal under the vLLM GitHub organization. Benchmarks show llama.cpp is ~1.2x faster in raw throughput, but vllm-metal brings the full vLLM ecosystem (OpenAI-compatible API, scheduler, batching) to Mac. Docker Desktop 4.62+ is required, and installation is a single command.

Docker Model Runner Adds vLLM Support on macOS

Giving Back: vllm-metal is Now Open Source

How does vllm-metal compare to llama.cpp?