NVIDIA's cuTile Python has a growing library of optimized GPU kernels, and this post covers how to translate them to cuTile.jl for Julia. It walks through a complete matrix multiplication example side-by-side, highlighting non-trivial semantic differences: 0-based vs 1-based indexing, implicit vs explicit broadcasting, row-major vs column-major memory layout, and API naming differences. The key outcome is a reusable AI agent skill packaged in TileGym—a structured directory of translation rules, API mappings, 17 critical pitfalls, a static validator script, and worked examples. With this skill, an LLM agent (e.g., Claude Code) can translate a cuTile Python kernel to validated Julia in a single pass (~4 minutes, ~78K tokens for GEMM) without manual intervention. The broader lesson: for AI-assisted systems work, encoding domain-specific rules in version control is more reliable than relying on general model knowledge.

9m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Cross-DSL GPU kernel translationTranslating cuTile Python to cuTile.jlWorkflow generation with agent skillsThe AI agent skill in TileGymResults and lessons learnedGet started using agent skills to translate Python kernels to Julia

Sort: