Flight Recorder: A New Lens for Understanding NCCL Watchdog Timeouts – PyTorch

NCCL watchdog timeouts are a common and notoriously hard-to-debug failure mode in distributed PyTorch training. The error is a catch-all triggered when a GPU collective operation exceeds a timeout, but the root cause is almost always collective desync rather than slowness. Based on Meta's fleet experience, over 60% of timeouts stem from CPU-side issues like execution divergence, PT2 compilation asymmetry, or improper error handling. Other causes include GPU compute kernel hangs, misconfigured collective arguments (especially all_to_all splits), and network/hardware failures. PyTorch's Flight Recorder (FR) addresses this by maintaining a per-rank ring buffer of collective metadata, call stacks, and state. On timeout, FR dumps records from all ranks via a TCP side-channel, enabling post-hoc cross-rank analysis. The fr_trace tool aligns and aggregates these records to identify mismatches, and Meta has built a visualization layer on top for rapid diagnosis. Two case studies illustrate how FR pinpointed CPU execution divergence in metric aggregation and misconfigured all_to_all splits in a RecSys workload.

#gpu

#pytorch

Apr 07•25m read time•From pytorch.org

Table of contents

Intro: What are collectives in PyTorch?Problem statement: The NCCL watchdog timeout error Deep dive: What causes NCCL collectives to time out?PyTorch’s diagnostic solution: Flight Recorder Case studies based on Meta workloads Future work Acknowledgements

Comment

Bookmark

Copy

Sort: