Removing the Guesswork from Disaggregated Serving

AIConfigurator is an open source tool that automates configuration optimization for LLM serving deployments. Instead of exhaustive GPU testing, it decomposes inference into primitive operations (GEMM, attention, MoE dispatch), benchmarks them in isolation, and reassembles measurements to estimate end-to-end performance across thousands of configurations in seconds. It supports disaggregated and aggregated serving modes, outputs Pareto frontier tradeoff visualizations, and generates ready-to-deploy Kubernetes artifacts. The tool now supports TensorRT-LLM, SGLang, and vLLM backends via a framework-agnostic abstraction layer, with community contributions from Mooncake and Alibaba. Alibaba's integration achieved 1.86x throughput on Qwen3-235B-FP8, and their HiSim simulator extends AIConfigurator's static analysis to dynamic traffic modeling with under 5% error. The roadmap includes deeper Dynamo platform integration, automated silicon data collection, and dynamic workload modeling.

Mar 09•9m read time•From developer.nvidia.com

Table of contents

Using AIConfigurator to configure disaggregated serving Extending support to multiple frameworks WideEP inference for SGLang How the SGLang community is contributing What’s next for AIConfigurator

Comment

Bookmark

Copy

Sort: