KServe and llm-d can be combined to build a production-grade generative AI inference platform on Kubernetes. KServe handles model lifecycle, autoscaling, and operational governance via its new LLMInferenceService (v0.16), while llm-d adds KV-cache-aware routing, disaggregated prefill/decode scheduling, and intelligent cross-pod orchestration. The separation of concerns between the two layers enables composability and independent evolution. Benchmark results show up to 57x improvement in Time to First Token (P90), double the token throughput, and ~50% reduction in tail latency compared to naive multi-replica deployments with random request routing.

Table of contents
KServe: Simplifying AI model deployment on KubernetesLLMInferenceService in KServeWhen KServe alone is not enough: The engineer's realityIntegrating KServe and llm-d: Why separation winsConclusionSort: