Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA introduces Wide Expert Parallelism (Wide-EP) in TensorRT-LLM to efficiently scale large Mixture-of-Experts (MoE) models like DeepSeek-R1 across GB200 NVL72 rack systems. By distributing experts across 8+ GPUs, Wide-EP reduces weight-loading overhead, improves GPU utilization through better load balancing, and leverages 130 TB/s NVLink bandwidth to minimize communication bottlenecks. Testing shows up to 1.8x higher per-GPU throughput compared to smaller expert parallelism configurations, significantly improving inference economics and total cost of ownership for trillion-parameter model deployments.

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

How to achieve large-scale expert parallelism

Wide-EP with TensorRT-LLM and NVIDIA Dynamo

What are the performance and workload economics?