NCCL watchdog timeouts are a common and notoriously hard-to-debug failure mode in distributed PyTorch training. The error is a catch-all triggered when a GPU collective operation exceeds a timeout, but the root cause is almost always collective desync rather than slowness. Based on Meta's fleet experience, over 60% of timeouts stem from CPU-side issues like execution divergence, PT2 compilation asymmetry, or improper error handling. Other causes include GPU compute kernel hangs, misconfigured collective arguments (especially all_to_all splits), and network/hardware failures. PyTorch's Flight Recorder (FR) addresses this by maintaining a per-rank ring buffer of collective metadata, call stacks, and state. On timeout, FR dumps records from all ranks via a TCP side-channel, enabling post-hoc cross-rank analysis. The fr_trace tool aligns and aggregates these records to identify mismatches, and Meta has built a visualization layer on top for rapid diagnosis. Two case studies illustrate how FR pinpointed CPU execution divergence in metric aggregation and misconfigured all_to_all splits in a RecSys workload.
Table of contents
Intro: What are collectives in PyTorch?Problem statement: The NCCL watchdog timeout errorDeep dive: What causes NCCL collectives to time out?PyTorch’s diagnostic solution: Flight RecorderCase studies based on Meta workloadsFuture workAcknowledgementsSort: