Microsoft and NVIDIA have released Part 2 of their collaboration on running NVIDIA Dynamo for large language model inference on Azure Kubernetes Service (AKS). The first announcement aimed for a raw t

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Microsoft and NVIDIA released Part 2 of their NVIDIA Dynamo collaboration, introducing automated resource planning and dynamic scaling for LLM inference on Azure Kubernetes Service. The release features two key components: the Dynamo Planner Profiler, which automates configuration searches to optimize GPU allocation for prefill and decode operations, and the SLO-based Dynamo Planner, which provides runtime orchestration by monitoring cluster state and scaling workers to meet latency targets. The system addresses the rate matching challenge in disaggregated serving, where inference workloads are split across different GPU pools, reducing manual configuration time while maintaining service level objectives during traffic fluctuations.

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference