Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan – PyTorch

A joint PyTorch and Nebius experiment demonstrates up to 41% faster pre-training of DeepSeek-V3 MoE models on a 256-GPU NVIDIA B200 cluster using TorchTitan. Two orthogonal optimizations were evaluated: MXFP8 mixed-precision training (via TorchAO) and DeepEP communication acceleration. For the 671B model, DeepEP alone yielded +32% throughput over BF16 baseline by replacing standard all-to-all with GPU-initiated RDMA+NVLink kernels. Combining MXFP8 on grouped GEMMs with DeepEP pushed total throughput to 918 tokens/sec, a +41% gain. Convergence experiments on the 16B model confirmed MXFP8 training matches BF16 quality with no degradation over 1,500 steps. Future work includes enabling MXFP8 on linear layers once torch.compile supports MXFP8 with tensor parallelism.

#pytorch

#deepseek

#mixture-of-experts

Apr 07•10m read time•From pytorch.org

Table of contents

TL;DR Why This Experiment Background Hardware and Cluster Environment Experiment 1: DeepSeek-V3 671B Experiment 2: DeepSeek-V3 16B MoE Loss Convergence Validation Summary of Results Lessons Learned Future works Reproducibility

Comment

Bookmark

Copy

Sort: