A joint PyTorch and Nebius experiment demonstrates up to 41% faster pre-training of DeepSeek-V3 MoE models on a 256-GPU NVIDIA B200 cluster using TorchTitan. Two orthogonal optimizations were evaluated: MXFP8 mixed-precision training (via TorchAO) and DeepEP communication acceleration. For the 671B model, DeepEP alone yielded

10m read timeFrom pytorch.org
Post cover image
Table of contents
TL;DRWhy This ExperimentBackgroundHardware and Cluster EnvironmentExperiment 1: DeepSeek-V3 671BExperiment 2: DeepSeek-V3 16B MoE Loss Convergence ValidationSummary of ResultsLessons LearnedFuture worksReproducibility

Sort: