cuTile.jlがNVIDIA CUDAタイルベースプログラミングをJuliaに導入
NVIDIAは、Julia言語向けにCUDA Tileベースのプログラミングを可能にするcuTile.jlを発表し、テンソルコアへの自動アクセスなど高性能GPUプログラミングの新たな道を開いた。
キーポイント
CUDA TileプログラミングのJuliaへの導入
cuTile.jlは、NVIDIA CUDAの重要な拡張であるTileベースプログラミングをJulia言語に統合し、GPU向けの新しいプログラミングパラダイムを提供する。
テンソルコアへの自動アクセス
この技術により、開発者は複雑な設定なしにNVIDIA GPUのテンソルコアやその他の特殊化ハードウェアを自動的に活用できるようになる。
高性能コンピューティングの民主化
JuliaコミュニティがNVIDIAの最先端GPU機能に直接アクセスできるようになり、科学計算やAI研究の高速化が期待される。
NVIDIAのエコシステム拡大
CUDA Tile技術をJuliaに統合することで、NVIDIAは科学計算やデータサイエンスで人気の高い言語コミュニティへの影響力を強化する。
影響分析・編集コメントを表示
影響分析
この発表は、高性能コンピューティングとAI研究の重要な言語であるJuliaに、NVIDIAの最先端GPU技術を直接統合する画期的な進展である。これにより、科学計算コミュニティがテンソルコアなどの特殊化ハードウェアを容易に活用できるようになり、研究開発の加速と新たなアルゴリズムの開発が期待される。
編集コメント
JuliaコミュニティにとってはGPUアクセラレーションの新たな扉が開かれた重要な発表。科学計算とAIの融合がさらに進む可能性がある。
imageNVIDIA CUDA Tileは、NVIDIA CUDAプログラミングにおける最も重要な追加機能の一つであり、テンソルコアやその他の専用ハードウェアへの自動アクセスを実現します...
原文を表示
NVIDIA CUDA Tile is one of the most significant additions to NVIDIA CUDA programming and unlocks automatic access to tensor cores and other specialized hardware. Earlier this year, NVIDIA released cuTile for Python, giving Python developers a natural way to write high-performance GPU kernels.
Now, the same programming model is available in Julia through cuTile.jl. In this blog post, we’ll explore how cuTile.jl simplifies the development of high-performance CUDA kernels, demonstrate its idiomatic Julia syntax, and discuss its performance parity with the existing cuTile Python implementation.
What is tile-based GPU programming?
Traditional GPU programming with CUDA requires developers to think about threads, warps, and memory hierarchies. While powerful, this approach requires the programmer to map algorithms onto hardware efficiently. With CUDA Tile, developers describe operations on tiles of data, and the compiler handles the mapping to hardware.
Consider vector addition. In the traditional GPU programming model, using CUDA.jl, the programmer must manage individual threads explicitly:
using CUDAfunction vadd(a, b, c, n) i = (blockIdx().x - 1) * blockDim().x + threadIdx().x if i <= n @inbounds c[i] = a[i] + b[i] end returnendthreads = 512blocks = cld(vector_size, threads)@cuda threads blocks vadd(a, b, c, vector_size)
With CUDA Tile through cuTile.jl, the same operations are now expressed at the tile level, hiding details like index calculations or out-of-bounds checks:
import cuTile as ctfunction vadd(a, b, c, tile_size) pid = ct.bid(1) tile_a = ct.load(a, pid, (tile_size,)) tile_b = ct.load(b, pid, (tile_size,)) ct.store(c, pid, tile_a + tile_b) returnendtile_size = 1024grid = cld(vector_size, tile_size)ct.launch(vadd, grid, a, b, c, ct.Constant(tile_size))
Compare this with the Python equivalent:
@ct.kerneldef vadd(a, b, c, tile_size: ct.Constant[int]): pid = ct.bid(0) tile_a = ct.load(a, index=(pid,), shape=(tile_size,)) tile_b = ct.load(b, index=(pid,), shape=(tile_size,)) ct.store(c, index=(pid,), tile=tile_a + tile_b)tile_size = 1024grid = ceil(vector_size / tile_size)ct.launch(stream, grid, vadd, (a, b, c, tile_size))
The two are strikingly similar, and this is deliberate. cuTile.jl keeps the abstraction level of kernels identical to those written in cuTile Python, making it easy to port code over or learn from the cuTile Python documentation. At the same time, it uses Julia idioms wherever possible to make the package intuitive for Julia programmers, including 1-based indexing and broadcast expressions for element-wise operations.
Idiomatic Julia kernels
Where this really shines is in kernels that go beyond simple loads and stores. The following is a row-normalization kernel—the core of layer normalization, without the weights and bias:
function normalize_rows(X, Y, tile_n) bid = ct.bid(1) tile = ct.load(X, (bid, 1), (1, tile_n)) mean = sum(tile; dims=2) / size(X, 2) centered = tile .- mean var = sum(centered .^ 2.0f0; dims=2) / size(X, 2) ct.store(Y, (bid, 1), centered ./ sqrt.(var .+ 1f-5)) returnend
In this example, sum, size, and sqrt are standard Julia functions augmented to work on tiles. The dots (.^, .-, ./) are standard Julia broadcasting syntax, showing the operation is applied element-wise. The kernel reads like regular Julia array code. The closer cuTile.jl kernels are to ordinary Julia, the easier it is to share and reuse code between the CPU and GPU.
Performance of cuTile.jl
cuTile.jl targets the same NVIDIA Tile IR backend as cuTile Python, so both packages produce the same kind of GPU machine code. On an NVIDIA GeForce RTX 5080 (compute capability 12.0, NVIDIA Blackwell architecture), compute-intensive kernels achieve performance parity with the Python implementation:
KernelcuTile.jlcuTile PythoncuTile.jl compared to cuTile PythonVector addition838 GB/s843 GB/s99%Matrix transpose797 GB/s812 GB/s98%Matrix multiplication50.9 TFLOPS50.5 TFLOPS100%Batch matrix multiply43.0 TFLOPS47.5 TFLOPS91%Table 1. Performance comparison of common GPU kernels when using Julia or Python as the front-end
Some kernels with more complex control flow, such as layer normalization or FFT, don’t reach full performance parity, as the cuTile.jl compiler is still maturing. These are tracked as known issues and are actively being worked on.
How cuTile.jl works
cuTile.jl uses a custom Julia compiler that intercepts standard library calls such as +, sum, reshape, and routes them to Tile IR operations. The resulting IR is then lowered to Tile IR bytecode, the same binary format that cuTile Python produces. From there, the NVIDIA tileiras compiler handles the final compilation to GPU machine code.
The generated Tile IR can be inspected for any kernel:
julia> ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16))cuda_tile.module @kernels { entry @vadd(%arg0: tile<ptr<f32>>, %arg1: tile<i32>, ...) { ... return }}
This transparency is valuable for debugging and for understanding how high-level Julia code maps to tile operations.
Current status of cuTile.jl
cuTile.jl is an experimental, open-source package under active development at JuliaGPU/cuTile.jl. It supports a broad set of tile operations such as memory access, arithmetic, reductions, scans, matrix multiply, shape manipulation, and atomics. It also includes working examples for vector addition, matrix multiplication, transpose, batch matrix multiply, layer normalization, and FFT.
That said, this is early-stage software, and:
Not all cuTile features are implemented.
Some Julia language features (notably iterator-based ‘for’ loops) aren’t supported in kernels or generate inefficient code
The integration with CUDA.jl needs to improve to facilitate coexistence with SIMT kernels.
APIs may change without notice.
The project builds on Julia’s existing GPU ecosystem, integrating with CUDA.jl for array management and kernel launching. Users who are already writing GPU code in Julia with CUDA.jl will find the transition to tile-based programming straightforward.
Getting started
Just like cuTile Python, cuTile.jl requires an NVIDIA Ada, NVIDIA Ampere or NVIDIA Blackwell GPU and an NVIDIA driver for CUDA 13.1 or higher. The package also requires Julia 1.11 or higher.
Launch Julia, and press ] from the REPL to enter the integrated package manager to install cuTile.jl:
pkg> add cuTilepkg> # if you want, run the test suite test cuTile
The GitHub contains a full list of supported operations and detailed documentation on how cuTile.jl differs from both cuTile Python and standard Julia.
About the Authors
関連記事
NVIDIA Cosmos World Foundation Modelsによる合成データのスケーリングと物理AI推論
ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング(9 分読了)
ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み