A joint PyTorch and Nebius experiment demonstrates up to 41% faster pre-training of DeepSeek-V3 MoE models on a 256-GPU NVIDIA B200 cluster using TorchTitan. Two orthogonal optimizations were evaluated: MXFP8 mixed-precision training (via TorchAO) and DeepEP communication acceleration. For the 671B model, DeepEP alone yielded +32% throughput over BF16 baseline by replacing standard all-to-all with GPU-initiated RDMA+NVLink kernels. Combining MXFP8 on grouped GEMMs with DeepEP pushed total throughput to 918 tokens/sec, a +41% gain. Convergence experiments on the 16B model confirmed MXFP8 training matches BF16 quality with no degradation over 1,500 steps. Future work includes enabling MXFP8 on linear layers once torch.compile supports MXFP8 with tensor parallelism.

10m read timeFrom pytorch.org
Post cover image
Table of contents
TL;DRWhy This ExperimentBackgroundHardware and Cluster EnvironmentExperiment 1: DeepSeek-V3 671BExperiment 2: DeepSeek-V3 16B MoE Loss Convergence ValidationSummary of ResultsLessons LearnedFuture worksReproducibility

Sort: