NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA is integrating CUDA Tile as a backend for OpenAI Triton, enabling developers to compile Triton kernels to CUDA Tile IR instead of PTX. CUDA Tile, introduced in CUDA 13.1, shifts GPU programming from thread-level SIMT to tile-based abstractions, reducing complexity while enabling compiler optimizations. The Triton-to-TileIR bridge preserves tile-level semantics and provides native Tensor Core support with architectural portability. Currently in active development as an incubator project, it requires CUDA 13.1+ and Blackwell GPUs, with source-based compilation only. Known limitations include unsupported operations and suboptimal tensor-of-pointer performance, which can be addressed by adopting TMA load/store APIs. Users can switch backends via environment variables without code rewrites.

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton