PyTorch offers insights into deep learning, neural network modeling, and machine learning research, providing documentation, tutorials, and best practices for building and training models with PyTorch framework. By exploring PyTorch's curated content, developers can learn about tensor computations, autograd mechanisms, and model deployment strategies for solving complex problems in computer vision, natural language processing, and reinforcement learning. Whether you're a researcher, practitioner, or enthusiast, PyTorch offers resources to advance your understanding of deep learning and push the boundaries of AI innovation.

PyTorch

PyTorch researchers achieved a 30.2% end-to-end training speedup for Llama4 Scout (a Mixture-of-Experts model) using MXFP8 precision instead of BF16, running on a 64-node/256-device GB200 cluster via TorchAO and TorchTitan. The post covers convergence results showing equivalent loss curves over 3k+ steps, performance benchmarks (5317 vs 6921 tokens/sec), and a detailed technical walkthrough of the forward and backward pass through dynamic MXFP8 quantization for grouped GEMMs. Key challenges include writing per-group scale factors to a blocked memory layout required by NVIDIA's tcgen05 tensorcore PTX instructions, handling dynamic token group sizes on-device without host-device syncs, and supporting groups along both the M and K dimensions for forward and weight gradient computations respectively. Future work includes unified APIs for dense and MoE MXFP8 training and MXFP8-quantized expert parallel all-to-all communications.

MXFP8 Training for MoEs: 1.3x training speedup vs BF16 for Llama4 Scout on GB200 cluster using TorchAO and TorchTitan – PyTorch