Pinterest engineers debugged a mysterious 50% throughput drop during a PyTorch upgrade that manifested as periodic training stalls. Through systematic investigation using profiling tools, they identified two root causes: a PyTorch dispatch mode interfering with torch.compile, and Ray's monitoring process using
•13m read time• From medium.com
Table of contents
IntroductionBackgroundThe ProblemFix the GPU RooflineBreakdown the ModelDebug torch.compilePinpoint the IssueFix the End-to-End ThroughputObservationsGet Pinterest Engineering’s stories in your inboxDiagnosisFinding the Root CauseThe Culprit and SolutionSummarySort: