Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

NVIDIA's cuTile Python has a growing library of optimized GPU kernels, and this post covers how to translate them to cuTile.jl for Julia. It walks through a complete matrix multiplication example side-by-side, highlighting non-trivial semantic differences: 0-based vs 1-based indexing, implicit vs explicit broadcasting, row-major vs column-major memory layout, and API naming differences. The key outcome is a reusable AI agent skill packaged in TileGym—a structured directory of translation rules, API mappings, 17 critical pitfalls, a static validator script, and worked examples. With this skill, an LLM agent (e.g., Claude Code) can translate a cuTile Python kernel to validated Julia in a single pass (~4 minutes, ~78K tokens for GEMM) without manual intervention. The broader lesson: for AI-assisted systems work, encoding domain-specific rules in version control is more reliable than relying on general model knowledge.

#python

#ai-agents

#gpu

#julia

#cuda

Apr 30•9m read time•From developer.nvidia.com

Table of contents

Cross-DSL GPU kernel translation Translating cuTile Python to cuTile.jl Workflow generation with agent skills The AI agent skill in TileGym Results and lessons learned Get started using agent skills to translate Python kernels to Julia

Comment

Bookmark

Copy

Sort: