NCCL watchdog timeouts are a common and notoriously hard-to-debug failure mode in distributed PyTorch training. The error is a catch-all triggered when a GPU collective operation exceeds a timeout, but the root cause is almost always collective desync rather than slowness. Based on Meta's fleet experience, over 60% of timeouts

25m read timeFrom pytorch.org
Post cover image
Table of contents
Intro: What are collectives in PyTorch?Problem statement: The NCCL watchdog timeout errorDeep dive: What causes NCCL collectives to time out?PyTorch’s diagnostic solution: Flight RecorderCase studies based on Meta workloadsFuture workAcknowledgements

Sort: