MarkTechPost·2026年6月22日 16:13·約11分で読める

MoonMath AI、AMD MI300X向けHIPアテンションカーネルをオープンソース化し、AIter v3を上回る性能を発揮

#LLM インフラ #GPU 最適化 #AMD MI300X #HIP #オープンソース

TL;DR

MoonMath AI が AMD MI300X GPU 向けに HIP ベースのオープンソースアテンションカーネルを公開し、AMD 公式の AITER v3 を全形状で上回る性能と数値精度を実現した。

AI深層分析2026年6月22日 17:02

重要/ 5段階

深度40%

キーポイント

AMD 公式カーネルを上回る性能達成

MoonMath AI の新カーネルは AMD MI300X 上で、同社が提供する最適化済みカーネル「AITER v3」をあらゆる形状と丸めモードで上回り、幾何平均で最大 1.26 倍の高速化を実現した。

HIP とアセンブリのハイブリッド手法

完全な手書きアセンブリではなく HIP で記述しつつ、ワンインストラクションのアセンブリラッパーを使用することで、コンパイラによるレジスタ割り当てを維持しながら命令制御を細かく行う独自の技術を採用した。

厳密な数値精度と互換性

丸めモードや NaN/Inf 処理において AITER と完全に一致し、結果はビットレベルで同一かつ決定論的であり、モデルの品質低下なしに高速化が可能である。

実環境での検証済み

SGLang の PR に採用され、Wan2.1 ビデオ拡散モデルの実行速度を 1.23 倍向上させるなど、実際のワークロードで効果を実証している。

8 ウェーブによるパイプライン最適化

CDNA3 アーキテクチャ上で 2 グループのウェーブを位相オフセットさせて実行し、行列計算とソフトマックス/メモリアクセスを重畳させることで、コアのアイドル時間を排除しています。

16×16×16 MFMA と L1 キャッシュ活用

同じスループットを持つ 32×32×8 よりも VGPR 圧力が低い 16×16×16 タイルを採用し、V 行列を L1 に常駐させることで LDS の空き領域を確保し、データ再利用率を高めています。

フラクションラウンドの解決とベンチマーク結果

Flash-Decoding 風の KV スプリットで残りの計算を処理し、RTZ などの厳密な丸めモードでも AITER v3 を上回る性能（最大 1.59 倍）を実現しました。

影響分析・編集コメントを表示

影響分析

この成果は、AMD のデータセンター GPU における AI 推論・学習のボトルネックを解消する重要な進歩であり、ベンダー非依存かつオープンソースな高性能カーネルが公式ライブラリを上回る可能性を示した。特に HIP ライティングによる柔軟性と、数値精度を保証した上で大幅な速度向上を実現した点は、大規模モデルの実装やクラウドプロバイダにおけるコスト削減に直結する実用的な価値を持つ。

編集コメント

AMD の公式最適化カーネルを、オープンソースコミュニティが凌駕した事例は極めて稀であり、HIP を活用した巧妙な実装手法が今後の GPU パフォーマンス競争の新たな基準となり得る。

MoonMath AI チームは、AMD の MI300X GPU 向けの bf16 フォワードアテンションカーネルをリリースしました。これは手書きのアセンブリではなく、HIP で記述されています。コードは MIT ライセンスの下でオープンソース化されています。MoonMath.ai チームによると、このカーネルは AMD 独自の最適化カーネルである AITER v3 を、テストされたすべての形状において上回っています。ベアメタルへのアクセス権は、AMD のクラウドプロバイダーである HotAisle から得られました。

アテンションとは、すべてのトランスフォーマー内部で実行される fused softmax(QKᵀ/√d)·V 演算のことです。MI300X は AMD の CDNA3 データセンター向け GPU で、ISA ターゲットは gfx942 です。このカーネルはこのハードウェア上でのみ動作します。

TL;DR

MoonMath.ai は、AMD MI300X 向けの bf16 フォワードアテンションカーネルをオープンソース化しました。これはアセンブリではなく HIP で記述されており（MIT ライセンス）。

すべての形状と丸めモードにおいて AMD の AITER v3 を上回り、幾何平均でそれぞれ 1.18 倍/1.15 倍/1.08 倍、最大で 1.26 倍の性能向上を達成しました。

核心的な工夫は、1 つの命令のアセンブリラッパーを使用することで、オペコードを選択しつつコンパイラーがレジスタ割り当てを行う点にあります。

速度向上の大部分はメモリの配置によるものです。K を LDS に、V を L1 にホットに配置し、Q とアキュムレータをレジスターに保持します。

実際の SGLang の PR ではこれを用いて Wan2.1 動画拡散モデルの処理速度を 1.23 倍向上させましたが、品質の低下はありませんでした。

カーネルの理解

カーネルとは、GPU の多数のコア上で直接実行され、特定の計算（ここではアテンション演算）をハードウェアが許す限り最速で行う小さなプログラムのことです。このカーネルは MI300X 上でのみ bf16 でフォワードアテンションを計算します。入力は BSHD または BHSD レイアウトのいずれかで受け付けられ、トランスポーズは行われません。ヘッド次元は固定で 128 です。すべてのシーケンス長に対応し、クロスアテンションもサポートしています。

実際には明確な限界があります。因果マスクも、GQA（グループ化クエリアテンション）も、可変長バッチ処理もサポートされていません。出力はbf16形式であり、gfx942ハードウェアでのみ動作します。

数値精度は厳密に制御されています。3つの丸めモードすべてがAITERの各モードごとの丸めルールと一致しています。すべての有限な出力値は、AITERに対して1 bf16 ULP（単位最終桁）以内の範囲内に収まります。NaNおよびInfの処理もビットレベルで完全に同一であり、結果は決定論的です。

核心となる技術：1命令アセンブリラッパー

この核心的な手法は、よくあるジレンマを回避します。コンパイラーのインストリンシック関数はコードを整理しますが、オペランドの順序入れ替えや名前変更を許容してしまいます。一方、生のインラインアセンブリは制御権を与えますが、手動でのレジスタおよびアドレス管理を強制することになります。

MoonMathは、__device__ __forceinline__ 関数内で正確に1命令をラップします。拡張されたアセンブリ制約式でオペランドを記述し、研究チームが Opcode（命令コード）を選択します。それでもコンパイラーはレジスタの割り当てとデータフローの追跡を担当します。

Copy CodeCopiedUse a different Browser

// in/out が同じ VGPR に紐付く → アキュムレータの名前変更なし、v_mov コピーなし。

__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {

asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"

: "+v"(c) : "v"(a), "v"(b));

}

"+v"(c) 制約により、アキュムレータの入力と出力が同じ VGPR（ベクトル汎用レジスタ）に紐付けられます。コピー命令は生成されません。これによりカーネルは通常の HIP に近い状態を維持しつつ、依然としてマシンに対して1命令ずつ制御を及ぼします。

アーキテクチャ：8つのウェーブ、2つのグループ、2つのバリア

CDNA3 の計算ユニットには 4 つの SIMD ユニットがあります。教科書的なブロック構成では 4 つのウェーブレット（wave）を使用しますが、MoonMath では各ブロックあたり 8 つのウェーブレットを、2 グループ（各 4 つ）で実行します。

この 2 つのグループは同じ Q*K、softmax、O += P*V のシーケンスを実行しますが、位相がずれています。一方のグループが行列コアを飽和させている間、もう一方は softmax を実行し、ロードを発行します。その後、役割を交代させるため、行列コアがアイドル状態になることはありません。

各イテレーションには 2 つの s_barrier（同期バリア）があります。1 つは位相の手渡し地点に配置され、もう 1 つはイテレーション境界に配置されます。残りの同期処理は、カウンターごとの待機によって処理されます。

これは FlashAttention-3 の行列乗算と softmax の交互実行を踏襲したものです。ただし、FA3 のプロデューサーとコンシューマーのウェープレット分割のコピーは行いません。CDNA3 ではすべてのメモリアクセスが非同期であるため、専用のプロデューサー用ウェーブレットは不要です。

データはどこにあり、なぜ 16×16×16 か

速度向上の大部分はメモリの配置によるものです。K は HBM から LDS（Local Data Store）へストリーミングされ、ダブルバッファ化されて全 8 つのウェーブレットで共有されます。V は L1 キャッシュにホット状態を維持し、すべての PV 行列乗算時に読み出されます。Q とアキュムレータはレジスタ上に配置されます。

研究チームは、32×32×8 の形状ではなく、16×16×16 の MFMA（Multi-Fused Matrix Multiply Accumulate）を選択しました。両方の形状はスループットが同等ですが、より小さなタイルでは各レーンあたり 4 つの fp32 要素にアキュムレートされ、16 個に対して圧力が低くなります。アキュムレータへの負荷が軽減されることで、より深いプリフェッチと第 3 の Q タイルを収容する余地が生まれます。

DecisionChoiceReason（決定・選択・理由）

Waves per block: 8 (2 グループ×4) — パイプラインを直接計画し、K のコピーを 1 つ共有

MFMA shape: 16×16×16 bf16 — 同じスループット、VGPR（Vector General Purpose Register）への負荷が低く、電力効率が優れる

K の配置は LDS で、ダブルバッファ化され、8 つのウェーブ全体で共有され、イテレーションごとにスワップされます。

V の配置は L1 にあり、レジデントかつプリフェッチされており、PV Across で再読まれ、意図的にホット状態が維持されます。

Q とアキュムレータは VGPR に配置され、毎回のイテレーションで読み込まれ、再ロードされることはありません。

2 つの後の勝利が差を縮めます。3 つ目の Q タイル（3Q）により、読み込まれた K および V タイルあたりのデータ再利用性が向上します。Flash-Decoding スタイルの尾部 KV スプリットは、MI300X の 304 CUs に残された分数ラウンドを救済します。これらの勝利が連鎖します。V を L1 へ移動させることで LDS が解放され、その LDS を 3 つ目の Q タイルが埋めることになります。

ベンチマーク

テストは MI300X 上で bf16（半精度浮動小数点）、ヘッド次元 128 で実行されました。各形状は 3 つの丸めモードで測定されています。RTNE は最も近い偶数へ丸めます。RTNA は最も近い値へ丸め、同点の場合はゼロから遠ざけます。RTZ はゼロ方向に切り捨てます。

(2, 24, 8192, 128) | RTNE | 3.083 | 3.792 | 1.23× | 1.37×

(2, 24, 16384, 128) | RTNE | 11.670 | 14.691 | 1.26× | 1.54×

(4, 16, 16384, 128) | RTZ | 15.055 | 16.183 | 1.07× | 1.47×

(2, 24, 32768, 128) | RTNA | 44.440 | 52.363 | 1.18× | 1.57×

(1, 16, 131072, 128) | RTNE | 232.517 | 269.278 | 1.16× | 1.46×

スウィープ全体での幾何平均は MoonMath を支持しています。AITER と比較すると、本手法は 1.18 倍（RTNE）、1.15 倍（RTNA）、1.08 倍（RTZ）のスコアを記録します。Modular MAX と比較すると、幾何平均は 1.44 倍から 1.49 倍となり、形状ごとの高速化率は最大 1.59 倍に達します。

RTZ は AITER の最も高速なモードであり、最も熾烈な競争です。(4, 16, 16384) の RTZ 形状は 0.95 倍から 1.07 倍へと改善されました。この最終的な差を埋めたのは尾部 KV スプリットです。

インタラクティブ解説

(function(){

window.addEventListener("message",function(e){

if(e&&e.data&&e.data.type==="mm-cdna3-height"){

var f=document.getElementById("mm-cdna3-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

Use Cases

このカーネルは pip でインストールされ、小さな API を公開します。呼び出し元のストリーム上で起動するため、より大きなパイプライン内でオーバーラップさせることが可能です。

Copy CodeCopiedUse a different Browser

import torch

import moonmath_attention as ma

PyTorch の ROCm ビルドでは AMD GPU に対して「cuda」デバイス文字列を使用します

q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

out = ma.forward(q, k, v, layout="bshd")

out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz")

具体的なユースケースの一つに動画拡散モデルがあります。チームは LiteAttention のサポートを追加し、SGLang diffusion に対して PR を送信しました。Wan2.1-T2V-1.3B-Diffusers において、アテンションを AITER から liteattention_rocm に切り替えました。MI300X 上でのエンドツーエンドの生成性能は 1.23 倍向上し、視覚的な品質の低下は見られませんでした。

BSHD レイアウトは拡散テンソルに直接適合します。クロスアテンションは任意の KV 長で動作し、パディングも不要です。

Key Takeaways

このカーネルは MI300X 向けの bf16 フォワードアテンションであり、HIP で記述され MIT ライセンスの下で公開されています。

あらゆる形状と丸めモードにおいて AITER v3 を上回り、幾何平均ではそれぞれ 1.18 倍/1.15 倍/1.08 倍の性能を示します。

1 つの命令からなるアセンブリラッパーは、オペコード制御を提供しつつ、コンパイラがレジスタを割り当てます。

メモリの配置が性能向上の大部分を担いました：K は LDS に、V は L1 でホット状態に、Q はレジスタ内に配置されています。

SGLang の公式 PR により、Wan2.1 の動画拡散モデルの処理速度が 1.23 倍に向上し、品質の低下もありませんでした。

技術詳細をご覧ください。また、Twitter でフォローしていただくことも歓迎です。さらに、150,000 人以上の ML サブレッドに参加したり、ニュースレターを購読することも忘れずにお願いします。待ってください！Telegram をご利用ですか？今なら Telegram でも私たちに参加できます。

GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションでパートナーシップをご検討の場合は、ぜひご連絡ください。

本記事「MoonMath AI が HIP アテンションカーネルをオープンソース化し、あらゆる形状と丸めモードにおいて AITER v3 を上回る性能を発揮（AMD MI300X 向け）」は、MarkTechPost で最初に公開されました。

原文を表示

MoonMath AI team has released a bf16 forward attention kernel for AMD’s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD’s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider.

Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.

TL;DR

MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).

It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×.

The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers.

Most of the speedup is memory placement — K in LDS, V hot in L1, Q and accumulators in registers.

A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23×, with no quality regression.

Understanding Kernel

A kernel is a small program that runs directly on the GPU’s many cores to perform one specific computation—here, the attention math—as fast as the hardware allows. The kernel computes forward attention in bf16 on MI300X only. It takes inputs in either BSHD or BHSD layout, with no transpose. Head dimension is fixed at 128. It supports any sequence length, including cross-attention.

There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively.

Numerics are tightly controlled. All three rounding modes match AITER’s per-mode rounding rule. Every finite output sits within 1 bf16 ULP of AITER. NaN and Inf handling is bit-identical, and results are deterministic.

The Core Trick: One-Instruction asm Wrappers

The core technique avoids a familiar dilemma. Compiler intrinsics keep code tidy but let the compiler reorder or rename operands. Raw inline assembly gives control but forces manual register and address management.

MoonMath wraps exactly one instruction in a __device__ __forceinline__ function. Extended asm constraints describe the operands. The research team picks the opcode. The compiler still allocates registers and tracks data flow.

Copy CodeCopiedUse a different Browser

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.

__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {

asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"

: "+v"(c) : "v"(a), "v"(b));

}

The "+v"(c) constraint ties the accumulator input and output to the same VGPR. No copy instruction is emitted. This keeps the kernel close to ordinary HIP. It still steers the machine one instruction at a time.

The Architecture: Eight Waves, Two Groups, Two Barriers

A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four.

The two groups run the same Q*K, softmax, O += P*V sequence. They are offset by a phase. While one group saturates the matrix core, the other runs softmax and issues loads. Then they swap, so the matrix core never idles.

There are two s_barriers per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization.

This echoes FlashAttention-3’s matmul and softmax alternation. It does not copy FA3’s producer and consumer warp split. On CDNA3, every memory move is already asynchronous, so a dedicated producer wave is unnecessary.

Where Data Lives, and Why 16×16×16

Most of the speedup comes from memory placement. K streams from HBM into LDS, double-buffered, shared by all eight waves. V stays hot in L1, read on every PV matmul. Q and accumulators live in registers.

The research team picked the 16×16×16 MFMA over 32×32×8. Both shapes have identical throughput. The smaller tile accumulates into 4 fp32 elements per lane, against 16. Lower accumulator pressure leaves room for deeper prefetch and a third Q tile.

DecisionChoiceReason

Waves per block8 (two groups of 4)Plan the pipeline directly; share one K copy

MFMA shape16×16×16 bf16Same throughput, lower VGPR pressure, better power efficiency

K placementLDS, double-buffered, 32 KiBShared by all 8 waves, swapped per iteration

V placementL1, resident, prefetchedReread across PV, kept hot deliberately

Q + accumulatorsVGPRsRead every iteration, never reloaded

Two later wins close the gap. A third Q tile (3Q) raises data reuse per loaded K and V tile. A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X’s 304 CUs. These wins cascade. Moving V to L1 freed the LDS that the third Q tile then fills.

Benchmark

Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero.

Shape (B, H, S, D)RoundOurs (ms)AITER v3 (ms)vs AITERvs MAX

(2, 24, 8192, 128)RTNE3.0833.7921.23×1.37×

(2, 24, 16384, 128)RTNE11.67014.6911.26×1.54×

(4, 16, 16384, 128)RTZ15.05516.1831.07×1.47×

(2, 24, 32768, 128)RTNA44.44052.3631.18×1.57×

(1, 16, 131072, 128)RTNE232.517269.2781.16×1.46×

Geomeans across the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups reach 1.59×.

RTZ is AITER’s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95× to 1.07×. The tail KV split is what closed that final gap.

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

if(e&&e.data&&e.data.type==="mm-cdna3-height"){

var f=document.getElementById("mm-cdna3-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

Use Cases

The kernel installs with pip and exposes a small API. It launches on the caller’s stream, so it overlaps inside larger pipelines.

Copy CodeCopiedUse a different Browser

import torch

import moonmath_attention as ma

PyTorch's ROCm build uses the "cuda" device string on AMD GPUs

q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

out = ma.forward(q, k, v, layout="bshd")

out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz")

One concrete use case is video diffusion. The team added LiteAttention support and sent a PR to SGLang diffusion. On Wan2.1-T2V-1.3B-Diffusers, they switched attention from AITER to liteattention_rocm. End-to-end generation improved by 1.23× on MI300X, with no visible quality regression.

The BSHD layout suits diffusion tensors directly. Cross-attention works with any KV length and no padding.

Key Takeaways

The kernel is bf16 forward attention for MI300X, written in HIP under MIT.

It beats AITER v3 on every shape and rounding mode, geomean 1.18×/1.15×/1.08×.

One-instruction asm wrappers give opcode control while the compiler allocates registers.

Memory placement drove most of the gain: K in LDS, V hot in L1, Q in registers.

A real SGLang PR sped up Wan2.1 video diffusion by 1.23× with no quality regression.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode appeared first on MarkTechPost.

この記事をシェア

MarkTechPost★42026年6月15日 18:16

Flash-KMeans の紹介：GPU で FAISS よりも 200 倍以上高速に動作する IO 対応型 K-Means アルゴリズム

カリフォルニア大学バークレー校とオースティン大学の研究チームは、現代の AI パイプライン向けに設計されたオープンソースライブラリ「Flash-KMeans」を発表した。この新技術は計算式を変更せず IO 効率を最適化することで、GPU 上で FAISS よりも 200 倍以上高速な K-Means 処理を実現する。

Smol AI News★42026年4月28日 14:44

本日は特に目立った出来事なし

AIニュース配信「AINews」は、2026年4月27日から28日にかけて12のサブレッドと544件のツイートを調査したが、特筆すべき技術進展や業界動向は見られず、静かな一日であったと報告している。

MarkTechPost★32026年6月24日 18:36

Graphify と NetworkX を用いた Python コードベース構造の可視化：ゴッドノード、コミュニティ、アーキテクチャ図の作成

MarkTechPost は、Graphify ツールと NetworkX ライブラリを使用して、Python アプリケーションを知識グラフに変換し、オフラインでコード構造を可視化するチュートリアルを紹介している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月22日 16:13·約11分で読める

MoonMath AI、AMD MI300X向けHIPアテンションカーネルをオープンソース化し、AIter v3を上回る性能を発揮

#LLM インフラ #GPU 最適化 #AMD MI300X #HIP #オープンソース

TL;DR

AI深層分析2026年6月22日 17:02

重要/ 5段階

深度40%

キーポイント

AMD 公式カーネルを上回る性能達成

HIP とアセンブリのハイブリッド手法

厳密な数値精度と互換性

実環境での検証済み

SGLang の PR に採用され、Wan2.1 ビデオ拡散モデルの実行速度を 1.23 倍向上させるなど、実際のワークロードで効果を実証している。

8 ウェーブによるパイプライン最適化

16×16×16 MFMA と L1 キャッシュ活用

フラクションラウンドの解決とベンチマーク結果

Flash-Decoding 風の KV スプリットで残りの計算を処理し、RTZ などの厳密な丸めモードでも AITER v3 を上回る性能（最大 1.59 倍）を実現しました。

影響分析・編集コメントを表示

影響分析

編集コメント

TL;DR

すべての形状と丸めモードにおいて AMD の AITER v3 を上回り、幾何平均でそれぞれ 1.18 倍/1.15 倍/1.08 倍、最大で 1.26 倍の性能向上を達成しました。

速度向上の大部分はメモリの配置によるものです。K を LDS に、V を L1 にホットに配置し、Q とアキュムレータをレジスターに保持します。

実際の SGLang の PR ではこれを用いて Wan2.1 動画拡散モデルの処理速度を 1.23 倍向上させましたが、品質の低下はありませんでした。

カーネルの理解

核心となる技術：1命令アセンブリラッパー

Copy CodeCopiedUse a different Browser

// in/out が同じ VGPR に紐付く → アキュムレータの名前変更なし、v_mov コピーなし。

__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {

asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"

: "+v"(c) : "v"(a), "v"(b));

}

アーキテクチャ：8つのウェーブ、2つのグループ、2つのバリア

データはどこにあり、なぜ 16×16×16 か

DecisionChoiceReason（決定・選択・理由）

Waves per block: 8 (2 グループ×4) — パイプラインを直接計画し、K のコピーを 1 つ共有

MFMA shape: 16×16×16 bf16 — 同じスループット、VGPR（Vector General Purpose Register）への負荷が低く、電力効率が優れる

K の配置は LDS で、ダブルバッファ化され、8 つのウェーブ全体で共有され、イテレーションごとにスワップされます。

V の配置は L1 にあり、レジデントかつプリフェッチされており、PV Across で再読まれ、意図的にホット状態が維持されます。

Q とアキュムレータは VGPR に配置され、毎回のイテレーションで読み込まれ、再ロードされることはありません。

ベンチマーク

(2, 24, 8192, 128) | RTNE | 3.083 | 3.792 | 1.23× | 1.37×

(2, 24, 16384, 128) | RTNE | 11.670 | 14.691 | 1.26× | 1.54×

(4, 16, 16384, 128) | RTZ | 15.055 | 16.183 | 1.07× | 1.47×

(2, 24, 32768, 128) | RTNA | 44.440 | 52.363 | 1.18× | 1.57×

(1, 16, 131072, 128) | RTNE | 232.517 | 269.278 | 1.16× | 1.46×

インタラクティブ解説

(function(){

window.addEventListener("message",function(e){

if(e&&e.data&&e.data.type==="mm-cdna3-height"){

var f=document.getElementById("mm-cdna3-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

Use Cases

Copy CodeCopiedUse a different Browser

import torch

import moonmath_attention as ma

PyTorch の ROCm ビルドでは AMD GPU に対して「cuda」デバイス文字列を使用します

q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

out = ma.forward(q, k, v, layout="bshd")

out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz")

BSHD レイアウトは拡散テンソルに直接適合します。クロスアテンションは任意の KV 長で動作し、パディングも不要です。

Key Takeaways

このカーネルは MI300X 向けの bf16 フォワードアテンションであり、HIP で記述され MIT ライセンスの下で公開されています。

あらゆる形状と丸めモードにおいて AITER v3 を上回り、幾何平均ではそれぞれ 1.18 倍/1.15 倍/1.08 倍の性能を示します。

1 つの命令からなるアセンブリラッパーは、オペコード制御を提供しつつ、コンパイラがレジスタを割り当てます。

メモリの配置が性能向上の大部分を担いました：K は LDS に、V は L1 でホット状態に、Q はレジスタ内に配置されています。

SGLang の公式 PR により、Wan2.1 の動画拡散モデルの処理速度が 1.23 倍に向上し、品質の低下もありませんでした。

原文を表示

Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.

TL;DR

MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).

It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×.

The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers.

Most of the speedup is memory placement — K in LDS, V hot in L1, Q and accumulators in registers.

A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23×, with no quality regression.

Understanding Kernel

There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively.

The Core Trick: One-Instruction asm Wrappers

Copy CodeCopiedUse a different Browser

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.

__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {

asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"

: "+v"(c) : "v"(a), "v"(b));

}

The Architecture: Eight Waves, Two Groups, Two Barriers

A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four.

There are two s_barriers per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization.

Where Data Lives, and Why 16×16×16

DecisionChoiceReason

Waves per block8 (two groups of 4)Plan the pipeline directly; share one K copy

MFMA shape16×16×16 bf16Same throughput, lower VGPR pressure, better power efficiency

K placementLDS, double-buffered, 32 KiBShared by all 8 waves, swapped per iteration

V placementL1, resident, prefetchedReread across PV, kept hot deliberately

Q + accumulatorsVGPRsRead every iteration, never reloaded

Benchmark

Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero.

Shape (B, H, S, D)RoundOurs (ms)AITER v3 (ms)vs AITERvs MAX

(2, 24, 8192, 128)RTNE3.0833.7921.23×1.37×

(2, 24, 16384, 128)RTNE11.67014.6911.26×1.54×

(4, 16, 16384, 128)RTZ15.05516.1831.07×1.47×

(2, 24, 32768, 128)RTNA44.44052.3631.18×1.57×

(1, 16, 131072, 128)RTNE232.517269.2781.16×1.46×

RTZ is AITER’s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95× to 1.07×. The tail KV split is what closed that final gap.

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

if(e&&e.data&&e.data.type==="mm-cdna3-height"){

var f=document.getElementById("mm-cdna3-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

Use Cases

The kernel installs with pip and exposes a small API. It launches on the caller’s stream, so it overlaps inside larger pipelines.

Copy CodeCopiedUse a different Browser

import torch

import moonmath_attention as ma

PyTorch's ROCm build uses the "cuda" device string on AMD GPUs

q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

out = ma.forward(q, k, v, layout="bshd")

out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz")

The BSHD layout suits diffusion tensors directly. Cross-attention works with any KV length and no padding.

Key Takeaways

The kernel is bf16 forward attention for MI300X, written in HIP under MIT.

It beats AITER v3 on every shape and rounding mode, geomean 1.18×/1.15×/1.08×.

One-instruction asm wrappers give opcode control while the compiler allocates registers.

Memory placement drove most of the gain: K in LDS, V hot in L1, Q in registers.

A real SGLang PR sped up Wan2.1 video diffusion by 1.23× with no quality regression.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode appeared first on MarkTechPost.

この記事をシェア

MarkTechPost★42026年6月15日 18:16

Flash-KMeans の紹介：GPU で FAISS よりも 200 倍以上高速に動作する IO 対応型 K-Means アルゴリズム

Smol AI News★42026年4月28日 14:44

本日は特に目立った出来事なし

MarkTechPost★32026年6月24日 18:36

Graphify と NetworkX を用いた Python コードベース構造の可視化：ゴッドノード、コミュニティ、アーキテクチャ図の作成

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む