Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28

NVIDIA NCCL 2.28 introduces major performance improvements for multi-GPU communication through GPU-initiated networking, device APIs for communication-compute fusion, and copy engine-based collectives. The release enables developers to write custom device kernels that perform network operations directly, offload communication tasks from streaming multiprocessors to dedicated copy engines, and achieve zero-SM operation for certain collectives. New features include the NCCL Inspector profiling plugin for always-on observability, native APIs for AllToAll/Gather/Scatter operations, symmetric kernel group call support, and a flexible environment plugin system. The update also adds CMake build support and enhanced plugin architecture with shared contexts for better cross-datacenter training scenarios.

#nvidia

#gpu

#distributed-systems

#cuda

Nov 11, 2025•9m read time•From developer.nvidia.com

Table of contents

Release highlights How the NCCL device API enables direct kernel communication Accelerating NCCL performance with copy engine offload Profiling and observability made easy with NCCL Inspector Improved developer experience with NCCL 2.28 CMake-based build system Get started with NCCL 2.28

Comment

Bookmark

Copy

Sort: