NVIDIA introduces Wide Expert Parallelism (Wide-EP) in TensorRT-LLM to efficiently scale large Mixture-of-Experts (MoE) models like DeepSeek-R1 across GB200 NVL72 rack systems. By distributing experts across 8+ GPUs, Wide-EP reduces weight-loading overhead, improves GPU utilization through better load balancing, and leverages 130 TB/s NVLink bandwidth to minimize communication bottlenecks. Testing shows up to 1.8x higher per-GPU throughput compared to smaller expert parallelism configurations, significantly improving inference economics and total cost of ownership for trillion-parameter model deployments.

9m read timeFrom developer.nvidia.com
Post cover image
Table of contents
How to achieve large-scale expert parallelismWide-EP with TensorRT-LLM and NVIDIA DynamoWhat are the performance and workload economics?Summary

Sort: