PyTorch researchers achieved a 30.2% end-to-end training speedup for Llama4 Scout (a Mixture-of-Experts model) using MXFP8 precision instead of BF16, running on a 64-node/256-device GB200 cluster via TorchAO and TorchTitan. The post covers convergence results showing equivalent loss curves over 3k+ steps, performance benchmarks
Table of contents
Performance benchmarksTorchTitan Config for MXFP8 MoE trainingTorchAO MXFP8 MoE training APIsFuture workAppendixSort: