Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

CUDA 13.3 introduces C++ support for NVIDIA CUDA Tile, a tile-based GPU programming model that abstracts away low-level SIMT details like thread indexing, memory movement, and hardware-specific features (tensor cores, shared memory, TMA). Developers write kernels using tile abstractions — tensor spans, partition views, and tile operations — while the compiler handles parallelism and optimization automatically. The post walks through a vector addition example comparing SIMT vs. tile-based C++ code, a complete runnable example with performance hints (__restrict__, alignment), and a matrix multiply kernel using ct::mma. Requires CUDA Toolkit 13.3 and a GPU with compute capability 8.0 or newer.

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile