IBM Research, Red Hat, and Google Cloud have donated llm-d to the CNCF as a sandbox project at KubeCon Europe 2026. llm-d is an open-source, Kubernetes-native framework for running LLM inference as a distributed, production-grade workload. It introduces prefill/decode disaggregation, KV-cache-aware routing, and a traffic/hardware-aware autoscaler — enabling independent scaling of inference phases across heterogeneous accelerators (NVIDIA, AMD, Intel, Google TPUs). Early Google Cloud testing showed 2x improvements in time-to-first-token. Built on vLLM and integrating with Kubernetes Gateway API Inference Extension and LeaderWorkerSet, llm-d aims to make distributed LLM inference a first-class cloud-native workload. Founding collaborators include NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Future work targets multi-modal workloads, multi-LoRA optimization, and deeper vLLM integration.

5m read timeFrom thenewstack.io
Post cover image

Sort: