NVIDIA Developer Blog·2026年6月16日 01:45·約12分で読める

高度な融合カーネルによる MoE 学習スループットの向上

#LLM #MoE #GPU #NVIDIA #Deep Learning Optimization

TL;DR

NVIDIA は、高度な融合カーネル技術の導入により、混合専門家モデル（MoE）の学習スループットを大幅に向上させる手法を発表した。

AI深層分析2026年6月16日 02:01

重要/ 5段階

深度40%

キーポイント

融合カーネル技術による最適化

NVIDIA は従来の処理フローを再設計し、複数の演算を単一の高度な融合カーネルに統合することでオーバーヘッドを削減する手法を採用した。

MoE 学習スループットの劇的向上

この技術革新により、混合専門家モデルのトレーニングにおける計算効率とデータ転送速度が大幅に改善され、大規模モデルの学習期間短縮が可能となった。

ハードウェア性能の最大化

発表された手法は NVIDIA の最新 GPU アーキテクチャの特性を最大限に引き出すように設計されており、理論上のピーク性能への接近を実現している。

影響分析・編集コメントを表示

影響分析

この発表は、MoE アーキテクチャを採用する次世代大規模言語モデルの開発において、計算リソースの制約を緩和し、より複雑で高性能なモデルの実現を加速させる重要な転換点となる。特に、トレーニング時間の短縮は研究開発サイクルの高速化に寄与し、AI 業界全体のイノベーション速度を高める可能性が高い。

編集コメント

MoE モデルの学習効率化は現在の AI 開発における最大の課題の一つであり、NVIDIA が提供する具体的な最適化手法は実務家にとって即座に価値のある情報です。

Mixture-of-experts (MoE) モデルは、現代の大規模 AI システムの基盤コンポーネントとして急速に普及しました。各トークンに対してパラメータの一部のみを活性化しながらモデル容量を実質的に大幅に拡大できるため、実用的な計算リソース予算内でパフォーマンスをスケーリングする究極のアプローチを提供し、広く採用されています。モデル規模がさらに拡大するにつれ、これらのブロックの最適化はトレーニングのスループット最大化にとって極めて重要になります。

これらの限界を押し広げるために、私たちは CuTe DSL（Domain Specific Language）でカスタムビルドされた、稠密モデルおよび MoE モデル向けの高度に融合された MLP カーネルを導入します。本来的なメモリおよび同期のボトルネックに対処することで、これらの新しいカーネルは非融合パスと比較して 1.3 倍から 2 倍のカーネルレベルでの高速化を実現し、フルイテレーション CUDA Graphs（Graph）における同期フリーの MoE 実行を可能にします。

NVIDIA のフルスタック DeepSeek-V3 プリートレーニング設定において、この最適化はエンドツーエンドのパフォーマンスで 8% の向上をもたらします。同様に GPT-OSS プリートレーニング設定では、93% のエンドツーエンドパフォーマンス向上が実現されます。トレーニング時間の短縮やハードウェア利用率の最適化をいずれ目指す場合でも、これらのカーネルは今日から cuDNN Frontend で利用可能であり、Transformer Engine や Megatron-Core を通じてシームレスにアクセスできます。

その仕組みを理解するには、現代の MoE ブロックを悩ませる最大のボトルネック 3 つを体系的に分解し、ハードウェアを意識したソフトウェアコードデザインを通じて Tensor Cores（テンソルコア）が絶えず稼働し続けるようにスタックを再設計した方法を明らかにする必要があります。

MoE ブロックにおけるトレーニングのボトルネック克服

MoE モデルのスループットを最大化するためには、まず計算サイクルがどこで消費されているかを正確にマッピングする必要がありました。MoE ブロック内の標準的なトレーニング反復の実行タイムラインをプロファイルした際、3 つのシステムレベルのボトルネックが浮き彫りになりました。

アクティベーションのボトルネック：アクティベーション関数は通常、メモリーバウンド型のカーネルや大規模なテンソルの読み書き操作を引き起こし、これらの間隔中に Tensor Cores が未利用のままになります。

CPU による制約/オーバーヘッド：ルーティングされたエキスパートでは、エキスパートあたりのトークン数が実行時に計算され、通常は CPU で処理されます。CPU が GPU の速度に追いつけない場合、CPU 側の操作がボトルネックとして顕在化します。これに対応するためには、CPU との同期や介入を必要としないカーネルを構築する必要があります。

量子化のコスト：アクティベーション関数と同様に、高精度なテンソルを低精度へ量子化するとメモリーバウンド型のカーネルとなり、Tensor Cores がアイドル状態になります。

これらの課題に対し、MoE ブロックの再設計において cuTE DSL（DSL）で記述されたカスタムカーネルを採用し、同期不要の MoE 向けに特別に設計された 3 つのカーネルファミリーを導入しました。

GroupGemm + Quantize

GroupGemm + Activation + Quantize/Transpose

GroupGemm + dActivation + Quantize/Transpose

サポートされている活性化関数は、SwiGLU、GeGLU、sReLU であり、クランプ処理とスケーリングの追加オプションも用意されています。

image*Figure 1. Fusing operations into a single custom kernel in the forward and backward pass with CuTe DSL*

融合された GEMM エピローグによる GLU 活性化関数の最適化

ゲート付き線形関数は最近非常に人気を集めており、現代のモデルのほとんどは SwiGLU や GeGLU など、ゲート付き線形ユニット (Gated Linear Unit: GLU) 活性化関数の変種を使用しています。これらの活性化関数は FC1 レイヤーの出力をチャンクに分割し、それらを結合して最終的な GLU 出力を作成します。私たちは、順伝播と逆伝播の両方において GEMM と対応する GLU 演算をシームレスに統合した融合カーネルを実装しました。

GLU 活性化関数は、GEMM のエピローグ内で融合させることが容易ではありません。なぜなら、GLU はテンソルの2つの異なるチャンク（入力とゲート）へのアクセスを必要とするからです。通常、これら2つのチャンクは異なるスレッドブロックによって計算され、2 つの出力を結合するためには、カーネルが両方の出力をグローバルメモリに書き込む必要があります。この融合を実現するために、重みを入力の列とゲートの列として再パックします。これにより、同じスレッドブロックが入力テンソルの半分のタイル幅とゲートテンソルの半分のタイル幅の両方にアクセスできるようになります。その結果、入力とゲートをグローバルメモリを経由することなくエピローグ内で結合することが可能になります。この再パックは、トレーニング開始前、チェックポイント読み込み中に実行できます。

同様に、逆伝播（バックプロパゲーション）においても、エピローグは GEMM の出力を読み取り、dSwiGLU を計算し、それを量子化してグローバルメモリに書き戻します。

image*図 2. 入力とゲートの重みがパックされることで、スレッドブロックは CUDA Core 内で SwiGLU 出力を計算するために両方の重みにアクセスできるようになります***

特筆すべきは、これらの融合パターンが中間テンソルの読み書きを単に排除するだけでなく、残存するメモリ操作を GEMM そのものと直接オーバーラップさせることで、利用効率を最大化することです。

SwiGLU、GeGLU、sReLU のようなコア活性化関数を超えて、これらのカーネルはネイティブに特徴量のスケーリング、テンソルのクランプ、バイアスベクトルの加算を含む融合されたエピローグ操作を処理します。

ホストとデバイスの同期および CPU 起動オーバーヘッドの排除

従来、カーネルが実行する作業量は、起動時のブロック数によって定義され、形状情報はホスト側で利用可能である必要があります。例えば、マルチストリームグループ化 GEMM では、異なる GEMM が別々のストリーム上で実行されます（ここで N はグループ数です）。各グループあたりのトークン数はランタイムで決定されるため、リソース利用率を最大化するために CPU はこれらの動的サイズを持つ GEMM を別々のストリーム上で起動する必要があります。

これにより 2 つの主要な問題が生じます。第一に、起動するカーネルの数はローカルエクスパートの数に応じてスケールします。第二に、カーネル起動前にホスト側で形状情報を取得するために同期ポイントが必須となります。これらの課題に対処するため、CuTe DSL GroupGEMM カーネルは GPU メモリ内でグループあたりのトークンを追跡します。これにより、反復処理中の CPU 依存性が排除され、反復全体にわたって CUDA Graphs が可能となり、CPU のボトルネックが効果的に解消されます。

MXFP8 および NVFP4 量子化の融合による露出メモリオーバーヘッドの削減

事前学習における MXFP8 や NVFP4 といった低精度レシピの人気は高まっており、これらの精度は精度への影響を最小限に抑えつつ、大幅な速度向上をもたらします。これらの低精度レシピでは、狭い精度の GEMM（行列乗算）演算のために、活性化関数の後に量子化と転置が行われます。

MXFP8 の場合、量子化カーネルは活性化関数の出力（BF16）を読み取り、MXFP8 形式の出力および逆伝播用の転置されたバージョンの出力を書き出します。新たに設計したカーネルはこの量子化ステップを GEMM カーネル自体に融合させることで、BF16 テンソルの追加読み書きを不要にしています。同様に NVFP4 の場合、カーネルは順伝播用に BF16 形式の出力とテンソルごとの amax（配列最大値）を生成し、逆伝播では出力の転置されたアダマール回転に対する amax を計算します。これにより、テンソルごとの amax 計算のための追加メモリパスが不要になります。

カーネルレベルでの向上から事前学習全体の速度向上へ

ユニットレベルのマイクロベンチマーク全体を通じて、これらの融合カーネルは大幅な速度向上をもたらします。従来の非融合実行パスと比較して、順伝播は最大 1.3 倍、逆伝播は最大 2.1 倍 加速されます。

これらの速度向上をエンドツーエンドのトレーニングスループット向上に結びつけるために、以下の機能もサポートしています:

ダイナミックスケジューリング: エキパート並列性やデータ並列性からの通信など、他のカーネルとの効率的なオーバーラップをサポートします。

設定可能なクラスタマージンにより、カーネルを実行する SM（ストリーミングマルチプロセッサ）の数を制限することで、ユーザーが SM リソースの一部を予約できる機能を提供し、他のカーネルが GPU 上で同時に起動・実行するための余裕を残します。

これらの同期不要なカーネルによりエンドツーエンドの CUDA グラフの利用や通信カーネルとの効率的なオーバーラップが可能となるため、個別のカーネルごとの高速化に加え、アプリケーション全体レベルでははるかに大きな速度向上が期待できます。内部テストでは、Deepseekv3 において最大8% のエンドツーエンド速度向上、GPT-OSS 事前学習ランにおいて最大93% のエンドツーエンド速度向上を確認しています。

私たちは引き続き新しいカーネルの追加や、これらのカーネルに対する新機能のサポートを継続しています。

image*図 3. GB200 における異なる活性化関数パターンでの速度向上。ベースラインは Transformer Engine の最適化カーネルを使用*

CuTe DSL 統合カーネルをどのように活用するか

これらのカーネルは、異なる抽象度レベルで使用可能です。

cuDNN Front-end (v1.23.0+): これらのカーネルは cuDNN Frontend ライブラリに格納されています。ユーザーはこのライブラリをソフトウェアスタックにインストールし、そこから直接これらのカーネルを呼び出すことができます。Cudnn-Frontend はまた、これらのカーネル用のラッパーも提供しており、初回の呼び出し時にカーネルをコンパイルし、その後の呼び出しではキャッシュされたオブジェクトを再利用します。ユーザーは、カーネルを直接呼び出すか、またはラッパー API を通じてカーネルにアクセスするかを選択できます。また、これらのカーネルに対して AOT（Ahead of time）コンパイルサポートをライブラリに導入する作業も積極的に進めており、これによりカーネルを cubins としてコンパイルし、ディスク上にキャッシュできるようにします。

Transformer Engine (v2.15+): ユーザーはまた、Transformer Engine を通じてこれらのカーネルを使用することもできます。Transformer Engine は、transformer_engine.pytorch.ops 構造体を通じてこれらの操作を公開しています。これらの操作は、transformer_engine.pytorch.ops.Sequential ブロックを使用して組み合わせることができ、内部でパターンマッチングを行い、cuDNN フロントエンドライブラリから融合されたカーネルを呼び出します。

Megatron Core (26.04-alpha.rc2+): ユーザーはまた、Megatron Core を通じてこれらのカーネルを使用することもできます。この場合、適切な一連のノブ（設定項目）を使用するだけで機能を呼び出すことができます。

image*図 4**.** ユーザーは、CUDA スタック内の異なる抽象化レイヤー（CuDNN Frontend、Transformer Engine、または Megatron Core）のいずれからでも、これらの融合カーネルをシームレスに統合して選択することができます*

次のステップ

私たちは現在、より多くの融合パターンや JAX などのより多くのフレームワークのサポートなど、複数の新機能の開発を積極的に進めています。

アクティベーションの再計算、最適なカーネルを選択するためのヒューリスティック、コンパイルコストを削減するための事前コンパイル（Ahead of Time (AOT) Compilation）、CPU オーバーヘッドの削減など、複数のカーネル最適化が進行中です。

特定の活性化関数をご希望の場合は、CuDNN カーネルを自分で調整し、PR を通じて貢献していただくことを推奨します。あるいは、cuDNN フロントエンドでその機能を追跡できるよう、Issue を作成してください。

コミュニティからのフィードバックを大歓迎いたします！

始め方

これらのカーネルの実行方法については、GitHub の手順に従ってください。

原文を表示

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable substantially larger model capacity while activating only a subset of parameters for each token, offering an unparalleled approach for scaling performance within a practical compute budget. As model scales continue to grow, the optimization of these blocks becomes critical for maximizing training throughput.

To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the CuTe DSL. By tackling inherent memory and synchronization bottlenecks, these new kernels deliver an impressive 1.3x–2x kernel-level speedup over unfused paths while enabling sync-free MoE execution for full-iteration CUDA Graphs.

In NVIDIA’s full-stack DeepSeek-V3 pre-training setup, this optimization contributes an 8% end-to-end performance improvement. Similarly for the GPT-OSS pre-training setup, this optimization contributes a 93% end-to-end performance improvement. Whether you want to slash training times or optimize hardware utilization, these kernels are available today in the cuDNN Frontend and can be seamlessly accessed through Transformer Engine and Megatron-Core.

To understand how, we need to take a systematic look into dismantling the three biggest bottlenecks plaguing modern MoE blocks and how we re-engineered the stack through hardware-aware software codesign to keep Tensor Cores continuously fed.

Overcoming training bottlenecks in the MoE block

To maximize the throughput of MoE models, we first had to map out exactly where compute cycles are being spent. When we profiled the execution timeline of a standard training iteration within the MoE block, three system-level bottlenecks stood out:

Activation Bottlenecks: Activation functions typically result in memory-bound kernels and large tensor read/write operations, leaving Tensor Cores underutilized during these intervals.

CPU boundedness/overhead: With routed experts, the tokens per expert are calculated at run time and are typically computed on the CPU. If the CPU cannot keep up with the GPU, the CPU operations get exposed. This calls for a need to build kernels which do not need CPU synchronization or intervention.

Quantization Cost: Just like activation functions, quantizing the tensors from high precision to lower precision results in memory bound kernels which keep the Tensor Cores idle.

We address these challenges in the re-design of the MoE block with custom kernels written in cuTE DSL and introduce a family of three kernels written for the sync-free MoE:

GroupGemm + Quantize

GroupGemm + Activation + Quantize/Transpose

GroupGemm + dActivation + Quantize/Transpose

The supported activation functions are SwiGLU, GeGLU, sReLU along with the option of adding clamping and scaling.

Figure 1. Fusing operations into a single custom kernel in the forward and backward pass with CuTe DSL

Optimizing GLU activation functions via fused GEMM epilogues

The Gated linear functions have become very popular recently, and most of the modern models use some variant of Gated Linear Unit (GLU) activation functions, such as SwiGLU, GeGLU, etc. These activation functions chunk the output of the FC1 layer and combine them to write the final GLU output. We implement a fused kernel that seamlessly merges the GEMM with the corresponding GLU operation in both forward prop and back prop.

GLU activation functions aren’t trivial to fuse within the epilog of the GEMM, as the GLU needs access to two different chunks of the tensor: input and gate. Typically, these two chunks would be computed by different thread blocks, and in order to combine the two outputs, the kernel needs to write both outputs to global memory. To achieve this fusion, we repack the weights into columns of input and gates. This ensures that the same thread block has access to both the half tile-width of the input tensor and the half tile-width of the gate tensor. This allows the input and gate to be combined in the epilogue without having to go to global memory. The repack can happen before the training starts, during the checkpoint loading.

Similarly in the back prop, epilog reads the GEMM output, calculates the dSwiGlu, quantizes it, and writes it back to global memory.

Figure 2. The Input and Gate weights get packed so that the thread block has access to both input and Gate weights to compute SwiGLU output within the CUDA Core

Notably, these fusion patterns don’t just eliminate the reads and writes of intermediate tensors, they also maximize utilization by overlapping any remaining memory operations directly with the GEMM itself.

Beyond core activation functions like SwiGLU, GeGLU, and sReLU, these kernels natively handle fused epilogue operations including feature scaling, tensor clamping, and bias vector additions.

Eliminating host-device synchronization and CPU launch overhead

Traditionally, the amount of work a kernel performs is defined by the block count at launch time, which requires shape information to be available on the host. For example, multi-stream grouped GEMM launches different GEMMs on separate streams, where is the number of groups. Because the number of tokens per group is determined at runtime, the CPU must launch these dynamically sized GEMMs on separate streams to maximize resource utilization.

This leads to two primary issues: First, the number of kernels to be launched scales with the number of local experts; and second, a synchronization point is mandatory to retrieve shape information on the host before kernel launch. To address these challenges, CuTe DSL GroupGEMM kernels track tokens per group within GPU memory itself. This eliminates CPU dependency during iteration and enables CUDA Graphs across the entire iteration, effectively removing the CPU bottleneck.

Fusing MXFP8 and NVFP4 quantization to reduce exposed memory overhead

The popularity of lower precision recipes such as MXFP8 and NVFP4 for pretraining is rising, with these precisions providing significant speedup with minimal impact to accuracy. In these low precision recipes, the activation function is followed by quantization and transpose for the narrow precision GEMM operation.

For MXFP8, the quantization kernel reads the output of the activation function (BF16) and writes the MXFP8 output and a transposed version of the output for the backprop. Our newly designed kernels fuse this quantization step into the GEMM kernel itself, eliminating the additional read and write of the BF16 tensor. Similarly for NVFP4, the kernel produces the BF16 output and the per tensor amax (array-maximum) for the forward prop, and for the back prop, it calculates the amax for the transposed hadamard rotation of the output. This eliminates the need for the extra memory pass for the per tensor amax calculation.

From Kernel-level gains to pretraining speedups

Across unit-level microbenchmarks, these fused kernels deliver a substantial speedup—accelerating the forward pass by up to 1.3x and the backward pass by up to 2.1x compared to traditional unfused execution paths.

In order to translate these speedups to end-to-end training throughput boost, they also support features such as:

Dynamic Scheduling to support efficient overlap with other kernels such as communication from expert parallelism, data parallelism, etc.

Configurable Cluster Margin to allow users to reserve a configurable margin of SM resources by limiting the kernel to fewer SMs, which leaves headroom for other kernels to launch and execute concurrently on the GPU.

In addition to the per kernel speedups, since these sync-free kernels allow end-to-end cuda graphs and efficient overlap with the communication kernels, there is a much larger speedup at the full application level. In internal testing, we see up to 8% end-to-end speedup on Deepseekv3 and up to 93% end-to-end speedup on GPT-OSS pre-training runs from these optimizations.

We are constantly adding new kernels and supporting new features to these kernels.

Figure 3. Speedup on different activation functions patterns on GB200. The baseline is using the optimized kernels from transformer engine

How to use CuTe DSL fused kernels to your advantage

These kernels are available to use at different abstraction levels.

cuDNN Front-end (v1.23.0+): The kernels are housed in the cuDNN Frontend library. Users can install the library in their software stack and invoke these kernels directly from there. CudNN-Frontend also provides a wrapper for these kernels, which compiles the kernel in the first invocation, and then re-uses the cached object for the subsequent calls. Users have an option to invoke the kernel directly or to access the kernels through the wrapper API. We are also actively working on bringing the AOT (Ahead of time) compilation support to the library for these kernels, so that the kernels can be compiled into cubins and cached in the disk.

Transformer Engine (v2.15+): Users can also use these kernels through the Transformer Engine. Transformer Engine exposes these operations through the transformer_engine.pytorch.ops construct. These operations can be combined using the transformer_engine.pytorch.ops.Sequential block, which internally pattern matches the ops to invoke the fused kernel from the cuDNN frontend library.

Megatron Core (26.04-alpha.rc2+): Users can also use these kernels through the megatron core, where the features can simply be invoked by using the right set of knobs.

Figure 4. Users can seamlessly choose to integrate these fusion kernels from any of the different abstraction layers in the CUDA stack: CuDNN Frontend, Transformer Engine or Megatron Core

What’s next?

We are actively working on multiple new features, such as supporting more fusion patterns, and supporting more frameworks such as JAX.

There are multiple kernel optimizations which are underway such as activation recompute, heuristics to pick the best kernels to compile, Ahead of Time (AOT) Compilation to reduce the compile cost, reducing CPU overheads, etc.

If you have an activation function you would like, we encourage users to tweak CuDNN kernels and contribute through PRs themselves. Or please add an issue for us to track the feature in cuDNN frontend.

Community feedback is very welcome!

Getting started

Follow the steps on GitHub to see how to run these kernels.

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

GitHub は、大規模なデータ組織が直面する自己完結型のデータアクセスと洞察提供の課題に対し、AI を活用した信頼性の高い解決策として、社内でデータ分析エージェントを構築したことを発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年6月16日 01:45·約12分で読める

高度な融合カーネルによる MoE 学習スループットの向上

#LLM #MoE #GPU #NVIDIA #Deep Learning Optimization

TL;DR

NVIDIA は、高度な融合カーネル技術の導入により、混合専門家モデル（MoE）の学習スループットを大幅に向上させる手法を発表した。

AI深層分析2026年6月16日 02:01

重要/ 5段階

深度40%

キーポイント

融合カーネル技術による最適化

NVIDIA は従来の処理フローを再設計し、複数の演算を単一の高度な融合カーネルに統合することでオーバーヘッドを削減する手法を採用した。

MoE 学習スループットの劇的向上

ハードウェア性能の最大化

発表された手法は NVIDIA の最新 GPU アーキテクチャの特性を最大限に引き出すように設計されており、理論上のピーク性能への接近を実現している。

影響分析・編集コメントを表示

影響分析

編集コメント

MoE ブロックにおけるトレーニングのボトルネック克服

アクティベーションのボトルネック：アクティベーション関数は通常、メモリーバウンド型のカーネルや大規模なテンソルの読み書き操作を引き起こし、これらの間隔中に Tensor Cores が未利用のままになります。

CPU による制約/オーバーヘッド：ルーティングされたエキスパートでは、エキスパートあたりのトークン数が実行時に計算され、通常は CPU で処理されます。CPU が GPU の速度に追いつけない場合、CPU 側の操作がボトルネックとして顕在化します。これに対応するためには、CPU との同期や介入を必要としないカーネルを構築する必要があります。

量子化のコスト：アクティベーション関数と同様に、高精度なテンソルを低精度へ量子化するとメモリーバウンド型のカーネルとなり、Tensor Cores がアイドル状態になります。

GroupGemm + Quantize

GroupGemm + Activation + Quantize/Transpose

GroupGemm + dActivation + Quantize/Transpose

サポートされている活性化関数は、SwiGLU、GeGLU、sReLU であり、クランプ処理とスケーリングの追加オプションも用意されています。

image*Figure 1. Fusing operations into a single custom kernel in the forward and backward pass with CuTe DSL*

融合された GEMM エピローグによる GLU 活性化関数の最適化

ホストとデバイスの同期および CPU 起動オーバーヘッドの排除

MXFP8 および NVFP4 量子化の融合による露出メモリオーバーヘッドの削減

カーネルレベルでの向上から事前学習全体の速度向上へ

これらの速度向上をエンドツーエンドのトレーニングスループット向上に結びつけるために、以下の機能もサポートしています:

ダイナミックスケジューリング: エキパート並列性やデータ並列性からの通信など、他のカーネルとの効率的なオーバーラップをサポートします。

設定可能なクラスタマージンにより、カーネルを実行する SM（ストリーミングマルチプロセッサ）の数を制限することで、ユーザーが SM リソースの一部を予約できる機能を提供し、他のカーネルが GPU 上で同時に起動・実行するための余裕を残します。

私たちは引き続き新しいカーネルの追加や、これらのカーネルに対する新機能のサポートを継続しています。

image*図 3. GB200 における異なる活性化関数パターンでの速度向上。ベースラインは Transformer Engine の最適化カーネルを使用*

CuTe DSL 統合カーネルをどのように活用するか

これらのカーネルは、異なる抽象度レベルで使用可能です。

cuDNN Front-end (v1.23.0+): これらのカーネルは cuDNN Frontend ライブラリに格納されています。ユーザーはこのライブラリをソフトウェアスタックにインストールし、そこから直接これらのカーネルを呼び出すことができます。Cudnn-Frontend はまた、これらのカーネル用のラッパーも提供しており、初回の呼び出し時にカーネルをコンパイルし、その後の呼び出しではキャッシュされたオブジェクトを再利用します。ユーザーは、カーネルを直接呼び出すか、またはラッパー API を通じてカーネルにアクセスするかを選択できます。また、これらのカーネルに対して AOT（Ahead of time）コンパイルサポートをライブラリに導入する作業も積極的に進めており、これによりカーネルを cubins としてコンパイルし、ディスク上にキャッシュできるようにします。

Transformer Engine (v2.15+): ユーザーはまた、Transformer Engine を通じてこれらのカーネルを使用することもできます。Transformer Engine は、transformer_engine.pytorch.ops 構造体を通じてこれらの操作を公開しています。これらの操作は、transformer_engine.pytorch.ops.Sequential ブロックを使用して組み合わせることができ、内部でパターンマッチングを行い、cuDNN フロントエンドライブラリから融合されたカーネルを呼び出します。

Megatron Core (26.04-alpha.rc2+): ユーザーはまた、Megatron Core を通じてこれらのカーネルを使用することもできます。この場合、適切な一連のノブ（設定項目）を使用するだけで機能を呼び出すことができます。

次のステップ

私たちは現在、より多くの融合パターンや JAX などのより多くのフレームワークのサポートなど、複数の新機能の開発を積極的に進めています。

コミュニティからのフィードバックを大歓迎いたします！

始め方

これらのカーネルの実行方法については、GitHub の手順に従ってください。

原文を表示

Overcoming training bottlenecks in the MoE block

Activation Bottlenecks: Activation functions typically result in memory-bound kernels and large tensor read/write operations, leaving Tensor Cores underutilized during these intervals.

CPU boundedness/overhead: With routed experts, the tokens per expert are calculated at run time and are typically computed on the CPU. If the CPU cannot keep up with the GPU, the CPU operations get exposed. This calls for a need to build kernels which do not need CPU synchronization or intervention.

Quantization Cost: Just like activation functions, quantizing the tensors from high precision to lower precision results in memory bound kernels which keep the Tensor Cores idle.

We address these challenges in the re-design of the MoE block with custom kernels written in cuTE DSL and introduce a family of three kernels written for the sync-free MoE:

GroupGemm + Quantize

GroupGemm + Activation + Quantize/Transpose

GroupGemm + dActivation + Quantize/Transpose

The supported activation functions are SwiGLU, GeGLU, sReLU along with the option of adding clamping and scaling.

Optimizing GLU activation functions via fused GEMM epilogues

Similarly in the back prop, epilog reads the GEMM output, calculates the dSwiGlu, quantizes it, and writes it back to global memory.

Beyond core activation functions like SwiGLU, GeGLU, and sReLU, these kernels natively handle fused epilogue operations including feature scaling, tensor clamping, and bias vector additions.

Eliminating host-device synchronization and CPU launch overhead

Fusing MXFP8 and NVFP4 quantization to reduce exposed memory overhead

From Kernel-level gains to pretraining speedups

In order to translate these speedups to end-to-end training throughput boost, they also support features such as:

Dynamic Scheduling to support efficient overlap with other kernels such as communication from expert parallelism, data parallelism, etc.

Configurable Cluster Margin to allow users to reserve a configurable margin of SM resources by limiting the kernel to fewer SMs, which leaves headroom for other kernels to launch and execute concurrently on the GPU.

We are constantly adding new kernels and supporting new features to these kernels.

How to use CuTe DSL fused kernels to your advantage

These kernels are available to use at different abstraction levels.

cuDNN Front-end (v1.23.0+): The kernels are housed in the cuDNN Frontend library. Users can install the library in their software stack and invoke these kernels directly from there. CudNN-Frontend also provides a wrapper for these kernels, which compiles the kernel in the first invocation, and then re-uses the cached object for the subsequent calls. Users have an option to invoke the kernel directly or to access the kernels through the wrapper API. We are also actively working on bringing the AOT (Ahead of time) compilation support to the library for these kernels, so that the kernels can be compiled into cubins and cached in the disk.

Transformer Engine (v2.15+): Users can also use these kernels through the Transformer Engine. Transformer Engine exposes these operations through the transformer_engine.pytorch.ops construct. These operations can be combined using the transformer_engine.pytorch.ops.Sequential block, which internally pattern matches the ops to invoke the fused kernel from the cuDNN frontend library.

Megatron Core (26.04-alpha.rc2+): Users can also use these kernels through the megatron core, where the features can simply be invoked by using the right set of knobs.

What’s next?

We are actively working on multiple new features, such as supporting more fusion patterns, and supporting more frameworks such as JAX.

Community feedback is very welcome!

Getting started

Follow the steps on GitHub to see how to run these kernels.

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

高度な融合カーネルによる MoE 学習スループットの向上

キーポイント

影響分析

編集コメント

MoE ブロックにおけるトレーニングのボトルネック克服

融合された GEMM エピローグによる GLU 活性化関数の最適化

カーネルレベルでの向上から事前学習全体の速度向上へ

CuTe DSL 統合カーネルをどのように活用するか

次のステップ

始め方

Overcoming training bottlenecks in the MoE block

Optimizing GLU activation functions via fused GEMM epilogues

Eliminating host-device synchronization and CPU launch overhead

Fusing MXFP8 and NVFP4 quantization to reduce exposed memory overhead

From Kernel-level gains to pretraining speedups

How to use CuTe DSL fused kernels to your advantage

What’s next?

Getting started

関連記事

高度な融合カーネルによる MoE 学習スループットの向上

キーポイント

影響分析

編集コメント

MoE ブロックにおけるトレーニングのボトルネック克服

融合された GEMM エピローグによる GLU 活性化関数の最適化

カーネルレベルでの向上から事前学習全体の速度向上へ

CuTe DSL 統合カーネルをどのように活用するか

次のステップ

始め方

Overcoming training bottlenecks in the MoE block

Optimizing GLU activation functions via fused GEMM epilogues

Eliminating host-device synchronization and CPU launch overhead

Fusing MXFP8 and NVFP4 quantization to reduce exposed memory overhead

From Kernel-level gains to pretraining speedups

How to use CuTe DSL fused kernels to your advantage

What’s next?

Getting started

関連記事