A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA CCCL 3.1 introduces a new single-phase API for CUB's reduction algorithms that lets developers explicitly control floating-point determinism. Three levels are available: not_guaranteed (fastest, uses atomics, results may vary between runs), run_to_run (default, hierarchical tree reduction, consistent on same GPU), and gpu_to_gpu (strictest, uses a Reproducible Floating-point Accumulator with binned exponent grouping, identical results across different GPUs). GPU-to-GPU determinism comes with a 20–30% performance cost for large inputs but provides tighter numerical error bounds than standard pairwise summation. The feature is currently limited to reductions, with plans to extend it to other parallel primitives.

Controlling Floating-Point Determinism in NVIDIA CCCL