NVIDIA CCCL 3.1 introduces a new single-phase API for CUB's reduction algorithms that lets developers explicitly control floating-point determinism. Three levels are available: not_guaranteed (fastest, uses atomics, results may vary between runs), run_to_run (default, hierarchical tree reduction, consistent on same GPU), and gpu_to_gpu (strictest, uses a Reproducible Floating-point Accumulator with binned exponent grouping, identical results across different GPUs). GPU-to-GPU determinism comes with a 20–30% performance cost for large inputs but provides tighter numerical error bounds than standard pairwise summation. The feature is currently limited to reductions, with plans to extend it to other parallel primitives.
Table of contents
Determinism not guaranteedRun-to-run determinismGPU-to-GPU determinismDeterminism performance comparisonConclusionSort: