CUDA 13.2 expands CUDA Tile support to Ampere and Ada GPU architectures (compute capability 8.x) in addition to Blackwell (10.x, 12.x), with a pip-installable cuTile Python DSL. Core runtime additions include new memcpy-with-attributes APIs, memory pool property queries, and a polymorphic cudaGraphNodeGetParams function. On Windows, GPUs now default to MCDM instead of TCC driver mode, bringing WSL2, container, and advanced memory management support. CCCL 3.2 ships modern idiomatic C++ runtime APIs (cuda::stream, cuda::buffer, cuda::launch) and new algorithms including Top-K selection (up to 5x faster than radix sort), fixed-size segmented reduction (up to 66x speedup), segmented scan, binary search, and FindIf. Python ecosystem improvements include CuPy support for CUDA 13.x, CUDA Stream Protocol interoperability with PyTorch/JAX, bfloat16 support, and cuda.core 0.6 with NVML bindings and stable CUDA Graphs API. Developer tooling gains NVIDIA Nsight Python for decorator-based kernel profiling, first-ever Numba-CUDA GPU debugging via CUDA-GDB, and Nsight Compute 2026.1 with report clustering and register dependency analysis. Embedded platforms gain MIG support on Jetson Thor for mixed-criticality workloads.

15m read timeFrom developer.nvidia.com
Post cover image
Table of contents
cuTile PythonCore enhancementsMath librariesDeveloper toolsCCCLCUDA PythonGet started with CUDA 13.2

Sort: