Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure

llm-d has been accepted as a CNCF Sandbox project, bringing Kubernetes-native distributed LLM inference infrastructure to the cloud native ecosystem. Originally launched in May 2025 by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, llm-d targets the gap between high-level control planes like KServe and low-level inference engines like vLLM. Key capabilities include inference-aware traffic routing via the Kubernetes Gateway API Inference Extension, prefill/decode disaggregation for independent scaling, hierarchical KV cache offloading, and hardware-agnostic serving across NVIDIA, AMD, and Google accelerators. Benchmarks from the v0.5 release show near-zero TTFT and ~120k tok/s throughput on Qwen3-32B compared to rapid degradation with a baseline Kubernetes service. The project aims to establish open, reproducible inference benchmarks as a neutral industry standard.

#kubernetes

#distributed-systems

#cncf

#vllm

Mar 24•5m read time•From cncf.io

Table of contents

What llm-d brings to the CNCF landscape SOTA inference performance on any accelerator Bridging cloud native and AI native ecosystems Get involved: Follow the “well-lit paths”

Comment

Bookmark

Copy

Sort: