IBM, Red Hat, and Google just donated a Kubernetes blueprint for LLM inference to the CNCF

IBM Research, Red Hat, and Google Cloud have donated llm-d to the CNCF as a sandbox project at KubeCon Europe 2026. llm-d is an open-source, Kubernetes-native framework for running LLM inference as a distributed, production-grade workload. It introduces prefill/decode disaggregation, KV-cache-aware routing, and a traffic/hardware-aware autoscaler — enabling independent scaling of inference phases across heterogeneous accelerators (NVIDIA, AMD, Intel, Google TPUs). Early Google Cloud testing showed 2x improvements in time-to-first-token. Built on vLLM and integrating with Kubernetes Gateway API Inference Extension and LeaderWorkerSet, llm-d aims to make distributed LLM inference a first-class cloud-native workload. Founding collaborators include NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Future work targets multi-modal workloads, multi-LoRA optimization, and deeper vLLM integration.

#kubernetes

#distributed-systems

#cncf

#vllm

Mar 24•5m read time•From thenewstack.io

Comment

Bookmark

Copy

Sort: