cuTile.jl v0.3 brings major improvements to GPU tile kernel programming in Julia. Key highlights include first-class CUDA.jl integration via the @cuda backend=cuTile macro, performance that now matches or exceeds cuTile Python on all benchmarks (with notable wins like +63% on FMHA attention and +37% on Layer Norm forward), dramatically reduced time-to-first-launch comparable to standard CUDA.jl kernels, new array slicing support via view/@view for TileArrays, and a tile-vectorized Philox2x32-7 RNG with both in-kernel and host-side APIs covering all major numeric types.
Table of contents
Performance: matching cuTile PythonCUDA.jl integration: @cuda backend=cuTileTime-to-first-launchArray slicingRandom number generationWhat's nextSort: