This talk was recorded at NDC London in London, England. #ndclondon  #ndcconferences #developer #softwaredeveloper    

Attend the next NDC conference near you: 
https://ndcconferences.com
https://ndclondon.com/

Subscribe to our YouTube channel and learn every day:   
/       @NDC 

Follow our Social Media!

https://www.facebook.com/ndcconferences
https://twitter.com/NDC_Conferences
https://www.instagram.com/ndc_conferences/

#ai #architecture #cloud 

Why do PoCs run smoothly while launch day implodes?

Because LLM traffic is a streaming, state-heavy beast that breaks every REST assumption: requests aren’t stateless, payloads snowball with context, and GPU memory melts under token floods. We’ll map the three checkpoints where most projects stall—context explosion, batch backfires, cache chaos—and show how LLM-D’s open-source sharding plus a hybrid NVIDIA/AMD node pool turns each choke point into a green light. You’ll see live before-and-after dashboards, get a YAML ladder you can drop into any cluster, and learn a back-of-the-napkin formula to keep cost per 1 000 tokens under control.

NDC Conferences

A conference talk from NDC London 2026 by two DigitalOcean solutions architects explaining why LLM inference workloads cannot be treated like traditional REST API requests. The talk covers the shift from model training to inference, the unique challenges of variable-payload LLM requests, and introduces LLMd (an open-source Kubernetes-native inference framework). Key concepts include prefill/decode disaggregation, intelligent request scheduling, session affinity, and KV cache routing. A live demo shows deploying a full GPU Kubernetes cluster with LLMd, Prometheus, and Grafana monitoring on DigitalOcean in under 15 minutes.

Stop Treating LLMs Like REST APIs - Jeff Fran & Jack Pearce - NDC London 2026