NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI and reasoning models in multi-node distributed environments. The framework, written in Rust and Python, supports dynamic GPU scheduling and LLM-aware request routing. It leverages advanced features like disaggregated prefill & decode

3m read timeFrom github.com
Post cover image
Table of contents
InstallationRunning and Interacting with an LLM LocallyLLM Serving

Sort: