PyTorch offers insights into deep learning, neural network modeling, and machine learning research, providing documentation, tutorials, and best practices for building and training models with PyTorch framework. By exploring PyTorch's curated content, developers can learn about tensor computations, autograd mechanisms, and model deployment strategies for solving complex problems in computer vision, natural language processing, and reinforcement learning. Whether you're a researcher, practitioner, or enthusiast, PyTorch offers resources to advance your understanding of deep learning and push the boundaries of AI innovation.

PyTorch

Meta shares how they improved Effective Training Time (ETT%) — the percentage of wall time spent on actual training — from below 90% to over 90% for offline recommendation/ranking workloads. The post defines ETT% and its sub-metrics (Time to Start, Time to Recover, Number of Failures), then details 40+ optimizations across four areas: trainer initialization (communication and pipeline parallelism), PyTorch 2.0 compilation (dynamic shapes handling, MegaCache reducing compile time by ~40%, autotune pruning), checkpoint management (async checkpointing and PyTorch native staging to reduce GPU blocking), and shutdown time (decoupling model publishing from training, saving ~30 minutes per job). Several improvements have been open-sourced via TorchRec and PyTorch 2.

Optimizing Effective Training Time for Meta’s Internal Recommendation