Meta shares how they improved Effective Training Time (ETT%) — the percentage of wall time spent on actual training — from below 90% to over 90% for offline recommendation/ranking workloads. The post defines ETT% and its sub-metrics (Time to Start, Time to Recover, Number of Failures), then details 40+ optimizations across four areas: trainer initialization (communication and pipeline parallelism), PyTorch 2.0 compilation (dynamic shapes handling, MegaCache reducing compile time by ~40%, autotune pruning), checkpoint management (async checkpointing and PyTorch native staging to reduce GPU blocking), and shutdown time (decoupling model publishing from training, saving ~30 minutes per job). Several improvements have been open-sourced via TorchRec and PyTorch 2.
Table of contents
Motivation and IntroductionEffective Training Time DefinitionThe Journey to Improve ETT% in MetaTechnique Deep-DivesIn the EndAcknowledgementsSort: