NVIDIA Dynamo is being optimized for agentic inference workloads, addressing the write-once-read-many KV cache access patterns seen in tools like Claude Code and Codex. The post covers three layers of optimization: (1) a multi-protocol frontend supporting v1/responses, v1/messages, and v1/chat/completions with a new 'agent hints' API extension that lets harnesses pass scheduling signals like priority, output sequence length estimates, and speculative prefill hints; (2) a KV-aware router with a Flash Indexer achieving 170M ops/s, priority scheduling via a binary heap, and extensible Python-based custom routing strategies (NeMo Agent Toolkit achieved 4x p50 TTFT reduction); and (3) advanced KV cache management including a 4-tier memory hierarchy (GPU→CPU→disk→shared storage), selective retention via priority/TTL/token-range directives, cross-worker block sharing via NIXL/RDMA, and agent lifecycle awareness to mark ephemeral blocks (reasoning tokens, terminated subagent KV) for early eviction. The goal is to bring managed API-level cache reuse performance to self-hosted open-source model deployments.

16m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Layer 1: The frontendLayer 2: The routerLayer 3: KV cache managementClosing the gap

Sort: