Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo

NVIDIA Dynamo is being optimized for agentic inference workloads, addressing the write-once-read-many KV cache access patterns seen in tools like Claude Code and Codex. The post covers three layers of optimization: (1) a multi-protocol frontend supporting v1/responses, v1/messages, and v1/chat/completions with a new 'agent hints' API extension that lets harnesses pass scheduling signals like priority, output sequence length estimates, and speculative prefill hints; (2) a KV-aware router with a Flash Indexer achieving 170M ops/s, priority scheduling via a binary heap, and extensible Python-based custom routing strategies (NeMo Agent Toolkit achieved 4x p50 TTFT reduction); and (3) advanced KV cache management including a 4-tier memory hierarchy (GPU→CPU→disk→shared storage), selective retention via priority/TTL/token-range directives, cross-worker block sharing via NIXL/RDMA, and agent lifecycle awareness to mark ephemeral blocks (reasoning tokens, terminated subagent KV) for early eviction. The goal is to bring managed API-level cache reuse performance to self-hosted open-source model deployments.

#ai-agents

Apr 17•16m read time•From developer.nvidia.com

Table of contents

Layer 1: The frontend Layer 2: The router Layer 3: KV cache management Closing the gap

Comment

Bookmark

Copy

Sort: