CUDA 13.3 introduces C++ support for NVIDIA CUDA Tile, a tile-based GPU programming model that abstracts away low-level SIMT details like thread indexing, memory movement, and hardware-specific features (tensor cores, shared memory, TMA). Developers write kernels using tile abstractions — tensor spans, partition views, and tile operations — while the compiler handles parallelism and optimization automatically. The post walks through a vector addition example comparing SIMT vs. tile-based C++ code, a complete runnable example with performance hints (__restrict__, alignment), and a matrix multiply kernel using ct::mma. Requires CUDA Toolkit 13.3 and a GPU with compute capability 8.0 or newer.

15m read timeFrom developer.nvidia.com
Post cover image
Table of contents
What is CUDA Tile C++?CUDA Tile C++ vector add exampleComplete vector add exampleDeveloper toolsMatrix multiplyGet started today with CUDA Tile C++

Sort: