The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

CUDA 13.1 introduces a new single-call API for CUB (CUDA Core Compute Libraries) that simplifies GPU primitive algorithm usage by eliminating the traditional two-phase pattern of memory estimation and allocation. The new API manages memory allocation automatically under the hood with zero performance overhead, while introducing an extensible environment argument that allows developers to configure execution options like custom memory resources and streams. This change addresses the widespread practice of wrapping CUB calls with macros (as seen in PyTorch) and provides a cleaner, more maintainable interface without sacrificing the flexibility needed for advanced use cases.

Streamlining CUB with a Single-Call API