Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28
NVIDIA NCCL 2.28 introduces major performance improvements for multi-GPU communication through GPU-initiated networking, device APIs for communication-compute fusion, and copy engine-based collectives. The release enables developers to write custom device kernels that perform network operations directly, offload communication tasks from streaming multiprocessors to dedicated copy engines, and achieve zero-SM operation for certain collectives. New features include the NCCL Inspector profiling plugin for always-on observability, native APIs for AllToAll/Gather/Scatter operations, symmetric kernel group call support, and a flexible environment plugin system. The update also adds CMake build support and enhanced plugin architecture with shared contexts for better cross-datacenter training scenarios.
Table of contents
Release highlightsHow the NCCL device API enables direct kernel communicationAccelerating NCCL performance with copy engine offloadProfiling and observability made easy with NCCL InspectorImproved developer experience with NCCL 2.28CMake-based build systemGet started with NCCL 2.28Sort: