Workato's AI Research Lab partnered with DigitalOcean to optimize LLM inference for agentic workloads at scale. By deploying NVIDIA Dynamo with vLLM on DigitalOcean Kubernetes Service (DOKS) using NVIDIA H200 GPUs, the team achieved 67% higher throughput per GPU, 79% lower end-to-end latency, and 77% lower time-to-first-token
Table of contents
How LLMs Process Requests and Why It Gets Expensive at ScaleHow KV-Aware Routing Addresses the ProblemNVIDIA Dynamo with DOKS: The Orchestration Brain for KV-Aware RoutingInference Stack ArchitectureThe Two Configurations TestedTuningConclusionSort: