NVIDIA's cuTile Python has a growing library of optimized GPU kernels, and this post covers how to translate them to cuTile.jl for Julia. It walks through a complete matrix multiplication example side-by-side, highlighting non-trivial semantic differences: 0-based vs 1-based indexing, implicit vs explicit broadcasting, row-major vs column-major memory layout, and API naming differences. The key outcome is a reusable AI agent skill packaged in TileGym—a structured directory of translation rules, API mappings, 17 critical pitfalls, a static validator script, and worked examples. With this skill, an LLM agent (e.g., Claude Code) can translate a cuTile Python kernel to validated Julia in a single pass (~4 minutes, ~78K tokens for GEMM) without manual intervention. The broader lesson: for AI-assisted systems work, encoding domain-specific rules in version control is more reliable than relying on general model knowledge.
Table of contents
Cross-DSL GPU kernel translationTranslating cuTile Python to cuTile.jlWorkflow generation with agent skillsThe AI agent skill in TileGymResults and lessons learnedGet started using agent skills to translate Python kernels to JuliaSort: