PyTorch introduces torchcomms, an experimental communication API designed for distributed training at massive scale (100K+ GPUs). The release includes NCCLX, a new backend used in production for Meta's LLMs like Llama. Torchcomms aims to enable faster prototyping of communication primitives, heterogeneous hardware support,
Table of contents
IntroductionQuickstartDeviceMeshInitial BackendsComposability: torchtitanNew APIsExtensibilityAcknowledgementsSort: