llm-d has been accepted as a CNCF Sandbox project, bringing Kubernetes-native distributed LLM inference infrastructure to the cloud native ecosystem. Originally launched in May 2025 by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, llm-d targets the gap between high-level control planes like KServe and low-level inference engines like vLLM. Key capabilities include inference-aware traffic routing via the Kubernetes Gateway API Inference Extension, prefill/decode disaggregation for independent scaling, hierarchical KV cache offloading, and hardware-agnostic serving across NVIDIA, AMD, and Google accelerators. Benchmarks from the v0.5 release show near-zero TTFT and ~120k tok/s throughput on Qwen3-32B compared to rapid degradation with a baseline Kubernetes service. The project aims to establish open, reproducible inference benchmarks as a neutral industry standard.
Table of contents
What llm-d brings to the CNCF landscapeSOTA inference performance on any acceleratorBridging cloud native and AI native ecosystemsGet involved: Follow the “well-lit paths”Sort: