NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI and reasoning models in multi-node distributed environments. The framework, written in Rust and Python, supports dynamic GPU scheduling and LLM-aware request routing. It leverages advanced features like disaggregated prefill & decode
Sort: