NVIDIA Developer Blog·2026年5月1日 00:54·約14分

AI エージェントによる GPU カーネル変換の自動化：cuTile Python から cuTile.jl へ

#GPU Computing #cuTile #Julia #Code Generation #NVIDIA

TL;DR

NVIDIA は AI エージェントを活用して、cuTile Python から Julia 言語への GPU カーネル変換を自動化するワークフローを発表し、マルチ言語エコシステムの拡張と開発効率の向上を実現した。

AI深層分析2026年5月1日 01:03

重要/ 5段階

深度40%

キーポイント

AI エージェントによる自動変換の実現

従来の手動変換に依存していたプロセスを、AI エージェントが自律的に実行することで、cuTile Python から cuTile.jl（Julia）へのコード変換を自動化した。

マルチ言語エコシステムの強化

Python と Julia という異なる言語圏のデベロッパー間で GPU 開発リソースを共有可能にし、cuTile モデルの採用範囲を広げる戦略的意義を持つ。

タイルベースプログラミングの効率化

ロード、ストア、計算などのタイルレベル操作に焦点を当てた cuTile の特性を活かしつつ、言語間の互換性確保による開発コストを削減した。

影響分析・編集コメントを表示

影響分析

この発表は、GPU 開発における言語間の壁を取り除く重要な一歩であり、特に Julia を好む研究者やデータサイエンティスト層への CUDA の浸透を加速させる可能性があります。AI エージェントをインフラに組み込むことで、複雑なハードウェア最適化コードの維持管理コストが低下し、より広範な開発者が高性能計算に参加できる土壌を作ります。

編集コメント

言語間のコード変換を AI エージェントに任せるというアプローチは、開発者体験（DX）の向上において極めて現実的な解決策と言えます。特に高性能計算分野で Python と Julia の使い分けが進む中で、この自動化技術がエコシステムの統合を後押しする鍵となるでしょう。

image

2026 年 4 月 30 日

AI 生成サマリー

いいえね

NVIDIA CUDA Tile (cuTile) は、タイルベースの GPU カーネルプログラミングを可能にし、cuTile.jl はこの抽象化を Julia に持ち込むことで、CUDA C++ を使用せずにカスタム GPU カーネルを作成できるようにします。これは Julia の科学計算エコシステムにとって不可欠です。
cuTile Python から cuTile.jl への GPU カーネルの翻訳には、0 ベースと 1 ベースのインデックス付け、行主序と列主序のメモリレイアウト、ブロードキャスト構文、およびカーネル API マッピングといった重要な意味論の違いを扱う必要があります。これらを誤って処理すると、沈黙したエラーが発生します。
TileGym プロジェクトは、17 の重要な翻訳ルール、静的検証スクリプト、および例示カーネル（add, matmul, softmax）をエンコードする AI ドライブ型のスキルベースのワークフローを開発しました。これにより、最小限の手作業で cuTile Python カーネルを Julia へ自動的、反復可能、かつ検証済みの方法で変換することが可能になりました。

AI 生成コンテンツは情報を不完全に要約している可能性があります。重要な情報は必ず確認してください。詳細はこちら

NVIDIA CUDA Tile（cuTile）は、スレッド、ワープ、共有メモリを手動で調整するのではなく、タイルレベルの操作——ロード、ストア、行列乗算加算——を単位として GPU カーネルを記述することを可能にする、タイルベースのプログラミングモデルです。

cuTile.jl は、この同じタイルベースのアプローチを動的プログラミング言語である Julia へ持ち込みました。ユーザーは NVIDIA CUDA C++ に降りることなく、カスタム GPU カーネルを記述できます。カスタムカーネルは、微分方程式、確率的プログラミング、物理シミュレーションにわたる Julia の科学計算エコシステムにおいて、しばしば不可欠です。

cuTile Python は、GPU アクセラレーションのための最適化されたカーネルのライブラリを徐々に増やしています。これらのカーネルを cuTile.jl へ翻訳する能力は、Julia エコシステムに、一つひとつゼロから書き直すのではなく、実戦で検証済みの実装への即座のアクセスを提供します。

本記事では、cuTile Python カーネルを cuTile.jl（Julia）へ移植するクロスドメイン固有言語（DSL）GPU カーネル翻訳について取り上げます。ここでは以下の方法を示します：

cuTile Python と cuTile.jl の間で GPU カーネルを翻訳する：行列乗算の完全な例を並行して解説します。

単純な翻訳で破綻する意味論的罠を回避する：インデックス付け、ブロードキャスト、メモリレイアウト、ループ形式は両方の DSL で大きく異なり、沈黙した不一致はコンパイラエラーではなく誤った結果を生みます。

再現可能かつスキル駆動型の AI ワークフローを構築する：翻訳知識は TileGym の LLM スキルにパッケージ化され、単一のパスで検証済みの Julia カーネルを生成します。これにより、一度きりの移植作業が体系化されます。

Cross-DSL GPU kernel translation（クロス DSL GPU カーネル翻訳）

cuTile Python と cuTile.jl のフロントエンドはどちらもタイル抽象化を共有しているため、翻訳は主にアルゴリズム的になります。しかし、両言語間の累積的な表面レベルの違いは重大であり、表 1 に示す通りです。

カテゴリPython (cuTile)Julia (cuTile.jl)

インデックス付け0 ベース (ct.bid(0))1 ベース (ct.bid(1))

ブロードキャスト暗黙的 (a + b)明示的なドット構文 (a .+ b)

メモリレイアウト行主順序列主順序

カーネル定義@ct.kernel デコレータプレーン関数 ... end

定数シグネチャに param: ct.Constant[int]、起動時に ct.Constant(val)

シグネチャに param::Int、起動時に ct.Constant(val)

型変換tile.astype(ct.float32)convert(ct.Tile{Float32}, tile)

行列乗算ct.mma(a, b, acc=acc)muladd(a, b, acc)

*表 1. Python と Julia でタイルコードを記述する際のハイレベルな違い*

Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side.

Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs—and silent mismatches produce wrong results, not compiler errors.

Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort.

これらの翻訳はいずれも概念的に難しいものではありませんが、ct.bid(0) を ct.bid(1) にするべき箇所を見落とし、結果として静かなるデータ破損を引き起こします。要素ごとの乗算には * を .* の代わりに使用し、Julia は黙って行列乗算を実行してしまいます。これらは数時間を無駄にするようなバグの典型例です。

有限個の再発する落とし穴を持つ共通の抽象化は、モデルが何に注意すべきかを教訓として学べば、AI 支援ワークフローに適しています。

cuTile Python から cuTile.jl への翻訳

このプロセスは実際のコードを通じて最もよく理解できます。以下の例は TileGym に由来するもので、チームは一連の cuTile Python カーネルを cuTile.jl へ移植し、それらを自己完結型の Julia サブプロジェクトとしてパッケージ化しました。

行列乗算の例

実行中の例では matmul（行列乗算）を使用しており、主要な翻訳上の課題を示すのに十分な複雑さを持っています。基本的な構文の違いを超えて、この翻訳はループ構造、TF32 テンサーコア変換、および行主順序から列主順序へのレイアウト変更を処理する必要があります。

cuTile Python:

@ct.kernel

def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int],

tk: ct.Constant[int]):

bid_m = ct.bid(0)

bid_n = ct.bid(1)

num_k = ct.num_tiles(A, axis=1, shape=(tm, tk))

acc = ct.full((tm, tn), 0, dtype=ct.float32)

dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype

for k in range(num_k):

a = ct.load(A, index=(bid_m, k), shape=(tm, tk),

padding_mode=ct.PaddingMode.ZERO)

b = ct.load(B, index=(k, bid_n), shape=(tk, tn),

padding_mode=ct.PaddingMode.ZERO)

a = a.astype(dtype)

b = b.astype(dtype)

acc = ct.mma(a, b, acc)

acc = ct.astype(acc, C.dtype)

ct.store(C, index=(bid_m, bid_n), tile=acc)

cuTile.jl (Julia):

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},

tm::Int, tn::Int, tk::Int) where {T}

bid_m = ct.bid(1)

bid_n = ct.bid(2)

num_k = ct.num_tiles(A, 2, (tm, tk))

acc = zeros(Float32, tm, tn)

U = T === Float32 ? ct.TFloat32 : T

for k in Int32(1):num_k

a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero)

b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero)

a = convert(ct.Tile{U}, a)

b = convert(ct.Tile{U}, b)

acc = muladd(a, b, acc)

end

acc = convert(ct.Tile{T}, acc)

ct.store(C; index=(bid_m, bid_n), tile=acc)

return

end

基本構文の変更を超えて、以下の点に注意してください:

レイアウトが反転します：Python の行主順序 A(M,K) は、Julia では列主順序 A_jl(K,M) となります。アキュムレータ、ロードインデックス、ストアインデックスはすべてこれに応じて変更されます。アキュムレータの形状を誤って (TM, TN) と指定し、本来必要な (TN, TM) にしない場合、コンパイラからの警告もなく間違った結果が得られます。

ct.mma → muladd: cuTile.jl は行列乗算累積を Julia の標準関数である muladd にマッピングし、ct.PaddingMode.ZERO は PascalCase 形式の ct.PaddingMode.Zero となります。

Softmax の例

Softmax はさらに一歩進んだ課題をもたらします。異なるテンソルサイズに対応するため、Julia では3つの戦略（テンソルメモリアクセラレータ (TMA) を用いたシングルタイル方式、オンライン処理、チャンク化処理）が実装されました。行列乗算パターンに加え、Softmax 関数ではブロードキャストドット構文（ct.exp(ct.sub(a, b)) → exp.(a .- b)）、名前変更されたリダクション演算子（ct.max → maximum、ct.sum → sum、軸 +1）、および要素ごとの最大値計算である ct.maximum(a, b) → max.(a, b) が導入されます。

しかし、真の課題は構文ではなく、変換を通じて正しいランニング最大値/合計統計量を維持することです。

エージェントスキルによるワークフロー生成

本プロジェクトにおける主な成果物は翻訳されたカーネルそのものではなく、それらを生成するために構築されたスキルでした。

image*図 1. 変換スキルは、変換ルール、API マッピング、例、検証、およびテストを単一の再利用可能なワークフローにパッケージ化します*

この文脈における「スキル」とは、リポジトリ内に存在し LLM エージェントによって読み込まれる構造化された知識のディレクトリを指します。この特定のスキルのパスは以下の通りです：.claude/skills/converting-cutile-to-julia/。

.claude/skills/converting-cutile-to-julia/

├── SKILL.md # エントリーポイント：ワークフロー概要、主要な落とし穴

├── translations/

│ └── workflow.md # チェックリスト付きの段階的な変換手順

├── references/

│ ├── api-mapping.md # 双方向 Python↔Julia API テーブル

│ ├── critical-rules.md # 17 のルール（インデックス付け、ブロードキャスト、ループなど）

│ ├── debugging.md # MethodError、IRError などのエラー診断

│ └── testing.md # データ型ごとのテストパターンと許容誤差

├── scripts/

│ └── validate_cutile_jl.py # 一般的なアンチパターンを検出する静的チェッカー

└── examples/

├── 01_add/ # ベクトル加算の Python→Julia 変換

├── 02_matmul/ # 行列乗算の Python→Julia 変換

└── 03_softmax/ ソフトマックスの Python→Julia 変換（3 つの戦略）

critical-rules.md には、チームが遭遇した 17 の落とし穴がすべて記載されています。表 2 は、最も一般的な落とし穴とそれに対応する修正方法を詳細に示しています。

#落とし穴修正方法

1タイル上での max(a, b) → IRError

max.(a, b)（ブロードキャストドット）を使用

2ct.load における order — インデックス位置が誤っている

order は形状とインデックスの両方を再マッピングする

*表 2. 遭遇したより一般的な問題の一部に対する落とし穴と対応する修正方法*

また、GPU 上で実行する前に、残った ct.bid(0)、カーネル内の for ループ、Python スタイルの型名などを検出する静的バリデータスクリプトも用意されています。これらすべてが整えられれば、モデルは毎回変換ルールを再発見する必要はありません。モデルはスキルを読み込み、チェックリストに従い、ルールを適用します。

TileGym における AI エージェントのスキル

具体的な成果物は、オープンソースである TileGym の julia/ ディレクトリ配下にある Julia サブプロジェクトです。

julia/

├── Project.toml # 依存関係：CUDA.jl, cuTile.jl, NNlib.jl, Test

├── kernels/

│ ├── add.jl # 1D 要素ごとの演算（alpha スケーリング付き）

│ ├── matmul.jl # 2D タイル化 MMA、列主順序レイアウト

│ └── softmax.jl # 3 つの戦略：TMA, オンライン方式，チャンク分割方式

└── test/

├── runtests.jl # テストランナー

├── test_add.jl

├── test_matmul.jl

└── test_softmax.jl

これら 3 つのカーネルは意図的に選択されました。add カーネルは、完全な翻訳対象範囲をテストするための最も単純な手法です。matmul はループ構造、テンソルコア、およびレイアウトの反転を追加します。softmax は、翻訳後も維持されなければならない不変条件を持つマルチパスアルゴリズムを導入します。各カーネルには、CPU 参照実装との比較を行うテストが含まれており、データ型ごとの許容誤差（tolerances）や、次元がタイルサイズに整列しない境界ケースも含まれています。

結果と教訓

このスキルを備えたことで、各カーネルのワークフローは以下のようになりました。

プリフライト：特別処理が必要なパターン（for ループ、ct.mma、order= など）をソースコードからスキャンする。
変換：API マッピングと重要なルールを適用する。
検証：静的チェッカーを実行する。
テスト：参照実装に対して Julia のテストを実行する。
修正：何かが失敗した場合、デバッグガイドを使用して修正し、再実行する。

代表的な一般行列乗算（GEMM）の変換プロセスでは、手動介入なしで最先端の大規模言語モデル（LLM）上で約 4 分、約 78,000 トークンを要しました。その後のカーネル変換は、例やルールがすでにリポジトリに含まれているため、より高速に行われました。

表 3 は、ポート中にバグの原因となった落とし穴を列挙しており、これらはすべて現在、スキル（skills）内で自動的に処理されています。

落とし穴 症状 根本原因

ct.bid(0) が未変更誤ったタイルが読み込まれ、沈黙の破損が発生 0 ベースと 1 ベースのインデックス付けの違い

a * b を要素ごとの乗算に使用 Matrix 乗算が行われる Julia の * は行列乗算；要素ごとの乗算には .* が必要

アキュムレータ形状 (TM, TN) 行列乗算で誤った結果が得られるカラムマイナー型では (TN, TM) が必要

ct.PaddingMode.ZEROUndefVarError Julia では PascalCase を使用：.Zero

*表 3. Python から Julia へのタイルコードのポート中にバグを引き起こす一般的な落とし穴、症状、および根本原因*

ここで伝えたいのは、AI がコードを書いたという事実そのものではありません。重要なのは、学習した内容をモデルが次回再利用できる形に捉える能力です。プロンプトでは「インデックス付けには注意せよ」と指示できますが、スキル（skill）では「17 の具体的な失敗パターンと、それらを検出する方法、そして自動的に検出するスクリプト」を提供できます。

これにより、将来のポート作業は、すでに動作例やテスト済みの API マッピング、静的検証ツール、デバッグガイドを備えたリポジトリから開始できるようになります。各回の作業にかかる労力は、前回のものよりも少なくて済みます。

より広い教訓として、AI をシステム開発に活用する際の課題はコード生成そのものではなく、コンパイラが意味上のミスを検出できない領域で正しいコードを生成することにある。ドメインルールを、それらが記述されるコードと共にバージョン管理にエンコーディングすることが、この課題への対処法のひとつとなる。

エージェントスキルを用いて Python カーネルを Julia へ翻訳し始めるには、以下のコードを試してください：

cd TileGym

Julia カーネルの探索

ls julia/kernels/ # add.jl, matmul.jl, softmax.jl

変換スキルの探索

ls .claude/skills/converting-cutile-to-julia/

Julia の依存関係のインストール（Julia 1.12+、CUDA 13.1+ ドライバが必要）

julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'

Julia カーネルテストの実行

julia --project=julia/ julia/test/runtests.jl

要件：

Julia 1.12+ および NVIDIA CUDA 13.1+ ドライバ
NVIDIA Ampere、NVIDIA Ada、または NVIDIA Blackwell GPU（計算能力 8.x, 10.x, 11.x, 12.x）
ファイルシステムアクセス権を持つ LLM エージェント（例：Claude Code）。ご自身のカーネルに対して変換スキルを使用するには、LLM エージェントを .claude/skills/converting-cutile-to-julia/SKILL.md に指向させ、cuTile Python カーネルを入力として提供し、Python カーネルから Julia への翻訳を開始してください。

著者について

image

原文を表示

A person working on code on their computer.

Apr 30, 2026

AI-Generated Summary

Dislike

NVIDIA CUDA Tile (cuTile) enables tile-based GPU kernel programming, and cuTile.jl brings this abstraction to Julia, allowing custom GPU kernels without using CUDA C++, critical for Julia's scientific computing ecosystem.
Translating GPU kernels from cuTile Python to cuTile.jl involves handling key semantic differences such as 0-based vs. 1-based indexing, row-major vs. column-major memory layout, broadcasting syntax, and kernel API mappings, which if mishandled, cause silent errors.
The TileGym project developed an AI-driven skill-based workflow that encodes 17 critical translation rules, static validation scripts, and example kernels (add, matmul, softmax), enabling automated, repeatable, and validated conversion of cuTile Python kernels to Julia with minimal manual effort.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory.

cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in Julia’s scientific computing ecosystem— spanning differential equations, probabilistic programming, and physics simulations.

cuTile Python has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch.

This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to cuTile.jl (Julia). It shows how to:

Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side.

Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs—and silent mismatches produce wrong results, not compiler errors.

Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort.

Cross-DSL GPU kernel translation

Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1.

None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1), and you get silent data corruption. Use * instead of .* for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours.

A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow—if the model is taught what to watch out for.

Translating cuTile Python to cuTile.jl

The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject.

Matrix multiplication example

The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout.

cuTile Python:

@ct.kernel

def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int],

tk: ct.Constant[int]):

bid_m = ct.bid(0)

bid_n = ct.bid(1)

num_k = ct.num_tiles(A, axis=1, shape=(tm, tk))

acc = ct.full((tm, tn), 0, dtype=ct.float32)

dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype

for k in range(num_k):

a = ct.load(A, index=(bid_m, k), shape=(tm, tk),

padding_mode=ct.PaddingMode.ZERO)

b = ct.load(B, index=(k, bid_n), shape=(tk, tn),

padding_mode=ct.PaddingMode.ZERO)

a = a.astype(dtype)

b = b.astype(dtype)

acc = ct.mma(a, b, acc)

acc = ct.astype(acc, C.dtype)

ct.store(C, index=(bid_m, bid_n), tile=acc)

cuTile.jl (Julia):

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},

tm::Int, tn::Int, tk::Int) where {T}

bid_m = ct.bid(1)

bid_n = ct.bid(2)

num_k = ct.num_tiles(A, 2, (tm, tk))

acc = zeros(Float32, tm, tn)

U = T === Float32 ? ct.TFloat32 : T

for k in Int32(1):num_k

a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero)

b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero)

a = convert(ct.Tile{U}, a)

b = convert(ct.Tile{U}, b)

acc = muladd(a, b, acc)

end

acc = convert(ct.Tile{T}, acc)

ct.store(C; index=(bid_m, bid_n), tile=acc)

return

end

Beyond the basic syntax changes, note the following:

The layout flips: The Python row-major A(M,K) becomes column-major A_jl(K,M) in Julia. The accumulator, load indices, and store indices all change accordingly. Get the accumulator shape wrong—say (TM, TN) instead of (TN, TM)—and you get wrong results with no compiler warning.

ct.mma → muladd: cuTile.jl maps matrix multiply-accumulate to the Julia standard muladd, and ct.PaddingMode.ZERO becomes ct.PaddingMode.Zero (PascalCase).

Softmax example

Softmax pushes things further. Three strategies were implemented in Julia—tensor memory accelerator (TMA) single-tile, online, and chunked—to handle different tensor sizes. On top of the matmul patterns, the softmax function brings in broadcast dot syntax (ct.exp(ct.sub(a, b)) → exp.(a .- b)), renamed reductions (ct.max → maximum, ct.sum → sum, axis +1), and element-wise ct.maximum(a, b) → max.(a, b).

But the real challenge isn’t syntax—it’s maintaining correct running max/sum statistics through the translation.

Workflow generation with agent skills

The primary outcome of this project wasn’t the translated kernels—it was the skill built to produce them.

Figure 1. The conversion skill packages translation rules, API mappings, examples, validation, and tests into a single reusable workflow

A skill, in this context, is a directory of structured knowledge that lives in the repository and is picked up by an LLM agent. The path to this particular skill is:.claude/skills/converting-cutile-to-julia/.

.claude/skills/converting-cutile-to-julia/

├── SKILL.md # Entry point: workflow overview, top pitfalls

├── translations/

│ └── workflow.md # Step-by-step conversion with checklists

├── references/

│ ├── api-mapping.md # Bidirectional Python↔Julia API table

│ ├── critical-rules.md # 17 rules (indexing, broadcasting, loops, ...)

│ ├── debugging.md # Error diagnosis for MethodError, IRError, etc.

│ └── testing.md # Test patterns, tolerances per dtype

├── scripts/

│ └── validate_cutile_jl.py # Static checker for common anti-patterns

└── examples/

├── 01_add/ # Python→Julia for vector addition

├── 02_matmul/ # Python→Julia for matrix multiply

└── 03_softmax/ # Python→Julia for softmax (3 strategies)

The critical-rules.md alone captures 17 pitfalls the team encountered. Table 2 details the most common pitfalls and the associated fixes.

There’s also a static validator script that catches things like leftover ct.bid(0), for loops inside kernels, and Python-style type names—before running on the GPU. With all of this in place, the model doesn’t have to rediscover the conversion rules each time. It reads the skill, follows the checklist, and applies the rules.

The AI agent skill in TileGym

The concrete deliverable is a Julia subproject under julia/ in TileGym, which is open source:

julia/

├── Project.toml # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test

├── kernels/

│ ├── add.jl # 1D element-wise with alpha scaling

│ ├── matmul.jl # 2D tiled MMA with column-major layout

│ └── softmax.jl # 3 strategies: TMA, online, chunked

└── test/

├── runtests.jl # Test runner

├── test_add.jl

├── test_matmul.jl

└── test_softmax.jl

These three kernels were deliberately selected. Kernel add is the simplest method to test the full translation surface. Matmul adds loop structure, tensor cores, and the layout flip. Softmax introduces multipass algorithms with invariants that have to survive translation. Each kernel has tests that compare against a CPU reference with per-dtype tolerances, including boundary cases where dimensions don’t align to tile sizes.

Results and lessons learned

With the skill in place, the workflow for each kernel looked like this:

Pre-flight: Scan the source for patterns that require special handling (for loops, ct.mma, order=, and so on).

Convert: Apply the API mapping and critical rules.

Validate: Run the static checker.

Test: Run Julia tests against reference implementations.

Fix: If something fails, use the debugging guide, fix, and rerun.

For a representative general matrix multiply (GEMM) conversion, the process took about 4 minutes and ~78K tokens on a frontier LLM with no manual intervention. Subsequent kernels were faster because the examples and rules were already in the repo.

Table 3 lists the pitfalls that caused bugs during ports, all of which are now handled automatically in the skills.

The takeaway isn’t that AI wrote the code. It’s the ability to capture what was learned into something the model can reuse next time. A prompt can say, “Be careful with indexing.” A skill can say, “Here are the 17 specific things that go wrong, here’s how to check for them, and here’s a script that catches them automatically.”

Now, future ports can start from a repo that already has working examples, a tested API mapping, a static validator, and a debugging guide. Each one takes less effort than the last.

A broader takeaway is that the challenge in using AI for systems work isn’t code generation—it’s producing correct code in domains where the compiler won’t catch semantic mistakes. Encoding domain rules in version control, alongside the code they describe, is one way to address this.

Get started using agent skills to translate Python kernels to Julia

Use the following code to try the Julia subproject and the conversion skill:

cd TileGym

# Explore the Julia kernels

ls julia/kernels/ # add.jl, matmul.jl, softmax.jl

# Explore the conversion skill

ls .claude/skills/converting-cutile-to-julia/

# Install Julia dependencies (requires Julia 1.12+, CUDA 13.1+ driver)

julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'

# Run the Julia kernel tests

julia --project=julia/ julia/test/runtests.jl

Requirements:

Julia 1.12+ and NVIDIA CUDA 13.1+ driver

NVIDIA Ampere, NVIDIA Ada, or NVIDIA Blackwell GPU (compute capability 8.x, 10.x, 11.x, 12.x)

An LLM agent with file system access (for example, Claude Code). To use the conversion skill for your own kernels, point your LLM agent at .claude/skills/converting-cutile-to-julia/SKILL.md, provide a cuTile Python kernel as input, and start translating Python kernels to Julia.

About the Authors

この記事をシェア

NVIDIA Developer Blog重要度42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

NVIDIA Developer Blog重要度42026年6月25日 01:30

物理的 AI アプリケーション向け NVIDIA GPU における BEV ポーリングの高速化

NVIDIA Developer Blog重要度42026年6月26日 07:25

Vulkan デスクリプタヒープの包括的サポートによるリソースバインディングの効率化

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Julia カーネルテストの実行

julia --project=julia/ julia/test/runtests.jl

要件：

Julia 1.12+ および NVIDIA CUDA 13.1+ ドライバ

NVIDIA Ampere、NVIDIA Ada、または NVIDIA Blackwell GPU（計算能力 8.x, 10.x, 11.x, 12.x）

ファイルシステムアクセス権を持つ LLM エージェント（例：Claude Code）。ご自身のカーネルに対して変換スキルを使用するには、LLM エージェントを .claude/skills/converting-cutile-to-julia/SKILL.md に指向させ、cuTile Python カーネルを入力として提供し、Python カーネルから Julia への翻訳を開始してください。

タグ

著者について

image

キーポイント

影響分析

編集コメント

Cross-DSL GPU kernel translation（クロス DSL GPU カーネル翻訳）

cuTile Python から cuTile.jl への翻訳

行列乗算の例

Softmax の例

エージェントスキルによるワークフロー生成

TileGym における AI エージェントのスキル

結果と教訓

Julia カーネルの探索

変換スキルの探索

Julia の依存関係のインストール（Julia 1.12+、CUDA 13.1+ ドライバが必要）

Julia カーネルテストの実行

タグ

著者について

Cross-DSL GPU kernel translation

Translating cuTile Python to cuTile.jl

Matrix multiplication example

Softmax example

Workflow generation with agent skills

The AI agent skill in TileGym

Results and lessons learned

Get started using agent skills to translate Python kernels to Julia

Tags

About the Authors

関連記事

キーポイント

影響分析

編集コメント

Cross-DSL GPU kernel translation（クロス DSL GPU カーネル翻訳）

cuTile Python から cuTile.jl への翻訳

行列乗算の例

Softmax の例

エージェントスキルによるワークフロー生成

TileGym における AI エージェントのスキル

結果と教訓

Julia カーネルの探索

変換スキルの探索

Julia の依存関係のインストール（Julia 1.12+、CUDA 13.1+ ドライバが必要）

Julia カーネルテストの実行

タグ

著者について

Cross-DSL GPU kernel translation

Translating cuTile Python to cuTile.jl

Matrix multiplication example

Softmax example

Workflow generation with agent skills

The AI agent skill in TileGym

Results and lessons learned

Get started using agent skills to translate Python kernels to Julia

Tags

About the Authors

関連記事