cuTile.jl v0.2 is a major update to the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. Key additions include native `for` loop support (replacing a `while`-loop workaround), a new `ct.@fpmode` macro for scoped floating-point mode control, keyword arguments for most operations, experimental host-level abstractions (`ct.Tiled`) for broadcasting and reductions without explicit kernel code, and in-kernel `print`/`println` debugging. New atomic operations (`atomic_max`, `atomic_min`, `atomic_or`, etc.) and Julia 1.13 support are also included. Under the hood, a new multi-pass optimization pipeline with algebraic simplification, comparison strength reduction, LICM, constant folding, and alias-aware token ordering dramatically improves generated Tile IR quality — reducing a layernorm kernel from 10,036 to 3,253 SASS instructions. All ported examples now perform within 10% of their cuTile Python counterparts.
Table of contents
Breaking changesNative for loopsFloating-point mode: ct.@fpmodeKeyword arguments for operationsExperimental host abstractionsDebugging with print / printlnMinor changesPerformance improvementsUpcoming webinarSort: