CUDA Toolkit 13.0 introduces shared memory register spilling, a new optimization that redirects register spills from slower local memory to faster on-chip shared memory. This feature reduces access latency and L2 cache pressure in register-heavy kernels, providing 5-10% performance improvements. The optimization requires explicit opt-in using a PTX pragma and works best with kernels that have defined launch bounds and don't use dynamic shared memory allocation.

9m read timeFrom developer.nvidia.com
Post cover image
Table of contents
How does shared memory register spilling optimize performance?What is an additional solution for register spills introduced in CUDA 13.0?How to opt in to shared memory register spillingWhat are the limitations of shared memory register spilling?What performance gains are possible for real workloads?Get started with shared memory register spilling optimization

Sort: