A conference talk from NDC London 2026 by two DigitalOcean solutions architects explaining why LLM inference workloads cannot be treated like traditional REST API requests. The talk covers the shift from model training to inference, the unique challenges of variable-payload LLM requests, and introduces LLMd (an open-source Kubernetes-native inference framework). Key concepts include prefill/decode disaggregation, intelligent request scheduling, session affinity, and KV cache routing. A live demo shows deploying a full GPU Kubernetes cluster with LLMd, Prometheus, and Grafana monitoring on DigitalOcean in under 15 minutes.

42m watch time

Sort: