What is llm-d? llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM. Simple LLM deployments – where an LLM is deployed to a single server – can  […]

The Ubuntu Blog provides updates, tutorials, and insights on the Ubuntu operating system and related projects. Covering topics such as Linux desktops, server administration, and cloud computing, the blog offers resources for developers and sysadmins working with Ubuntu. Developers can learn how to set up, configure, and optimize Ubuntu systems for development, deployment, and production environments by following the Ubuntu Blog.

Ubuntu

llm-d is an open source, cloud-native system for high-scale LLM deployments that addresses latency and memory-bandwidth bottlenecks by disaggregating the inference workload across dedicated hardware. It splits processing into four Kubernetes-managed components: an inference scheduler (adaptive load balancer using Prometheus metrics), a KV cache manager, a prefill worker for compute-intensive prompt processing, and a decode worker for token generation. The post explains why organizations with data sovereignty requirements would self-host LLMs, highlights the availability of competitive open-weight models, and shares experimental Juju charms the author built to simplify llm-d deployments on Ubuntu without requiring deep Kubernetes expertise.

Understanding disaggregated GenAI model serving with llm-d