Hugging Face Blog·2026年2月26日 09:00·約11分

トランスフォーマーにおけるエキスパート混合（MoEs）

#MoE #Transformer #LLM最適化 #Hugging Face #スパースモデル #計算効率

TL;DR

トランスフォーマーモデルに複数の専門家ネットワークを組み合わせるMoE手法を紹介。効率的な計算と高性能化を実現する技術で、大規模AIモデルの開発に寄与。

AI深層分析2026年2月26日 22:41

重要/ 5段階

キーポイント

MoEはTransformerの一部層を専門家（expert）の集合体に置き換え、トークンごとに少数の専門家のみを活性化することで、モデル容量を維持しつつ推論速度を向上させる

従来の密な（dense）モデルのスケーリングには限界があり、MoEは計算効率の向上とトレーニングコスト削減を実現する

Hugging FaceのTransformersライブラリでMoEモデルが実装・利用可能になり、実践的な応用が進んでいる

影響分析・編集コメントを表示

影響分析

MoE技術は大規模言語モデルの実用化におけるボトルネックである計算コストと推論遅延を解決する可能性があり、より大規模で高性能なモデルの民主化を促進する。Hugging Faceによる実装公開は、研究コミュニティと産業界への普及を加速させる重要な役割を果たす。

編集コメント

大規模言語モデルの実用化におけるコスト課題を解決する有望なアプローチとして、MoE技術の実装面での進展を分かりやすく解説している。

記事に戻る

トランスフォーマーにおけるエキスパートの混合 (MoEs)

11 件の高評価

過去数年間、稠密な言語モデルのスケール拡大が大規模言語モデル (LLM) の進歩を牽引してきました。初期のモデルであるオリジナルの ULMFiT（約 3000 万パラメータ）や GPT-2（15 億パラメータ、当時は「公開するには危険すぎる」と考えられていた 🧌）から始まり、最終的には今日のような数百億パラメータを持つシステムに至るまで、そのレシピは単純でした。

より多くのデータ + より多くのパラメータ = より良いパフォーマンス。

スケール則はこの傾向を強化しましたが、稠密なモデルのスケール拡大には実用的な限界があります。

学習コストがますます高くなる。
推論レイテンシが増大する。
デプロイには大量のメモリとハードウェアが必要となる。

ここで登場するのがエキスパートの混合 (Mixture of Experts: MoEs) です。

MoE についてすでに知識があり、トランスフォーマーにおけるエンジニアリングの実装に直接飛び込みたい場合は、"Transformers and MoEs" のセクションへ直接お進みください。

稠密からスパースへ：MoE とは何か

Mixture of Experts モデルは Transformer のバックボーンを維持しつつ、特定の稠密なフィードフォワード層を複数のエキスパートのセットに置き換えます。「エキスパート」とは、「数学エキスパート」や「コードエキスパート」のようなトピック特化型モジュールではなく、単なる学習可能なサブネットワークです。各トークンに対して、ルーターが処理のために小さなサブセットのエキスパートを選択します。

図 1: 4 つのエキスパートのうち 1 つが活性化されている（出典：Maarten Grootendorst）

異なるトークンは、その隠れ表現に基づいて異なるエキスパートを活性化します。

モデルの容量は総パラメータ数に依存しますが、推論速度はアクティブなパラメータ数に依存します。

これが核心となる考え方です。

例えば、gpt-oss-20b を考えてみましょう。

800 / (3.6 * 2)

この極めて高速な処理速度は、モデルが約 3.6B パラメータのモデルとして動作していることを裏付けていますが、同時に 21B パラメータのモデルと同等の容量（または品質）を有しています。

（注：モデルが使用するネイティブの mxfp4 量子化にカーネルを使用すれば、速度はさらに向上します）

MoE は以下の理由から魅力的です：

より優れた計算効率

固定されたトレーニング FLOP バジェットにおいて、MoE は稠密な対照モデルを凌駕することが多いです。

図 2: 稠密型と MoE のトレーニング曲線の比較（出典：OLMoE: Open Mixture-of-Experts Language Models）

これは、より迅速な反復と優れたスケーリング効率を意味します。

自然な並列化軸

エキスパートは計算グラフにおいて構造的な境界を提供します。異なるトークンが異なるエキスパートに関与するため、エキスパート間での並列化が可能になります（この点については後述する「エキスパート並列性」で詳しく議論します）。

業界における採用

過去数週間に発表された主要なオープンモデルの MoE リリースには、Qwen 3.5、MiniMax M2、GLM-5、Kimi K2.5 などがあります。

この傾向は、2025 年 1 月の DeepSeek R1 の成功により加速し、DeepSeek V2 などの先行システムを基盤としています。また、早期の MoE の一例として、Mixtral-8x7B が 2023 年 12 月にリリースされています。

図 3: トランスフォーマーへの MoE モデル追加の 2 年間のタイムライン

クローズドな研究機関もまた MoE を採用しています。ChatGPT がスパースアーキテクチャを使用しているという噂は長く存在しており、オープンソース化された gpt-oss モデルでは確かにその実装が確認されています。

MoE 全般についてさらに学びたい場合は、当ブログ記事の閲覧と、ルーティングに関する最近の YouTube ビデオのご視聴を強くお勧めします。

トランスフォーマーと MoE

エコシステム内のほとんどのツール（モデルの読み込み、デバイス配置、量子化、バックエンド実行など）は、もともと密なモデルのために設計されました。MoE はこれらの前提に挑戦しています。

トランスフォーマーにおける MoE のファーストクラス・シチズン化への取り組み

重み読み込みのリファクタリング
エキスパート並列性
トランスフォーマーを用いた MoE のトレーニング

重み読み込みのリファクタリング

AutoModelForCausalLM.from_pretrained("model_id")

MoE の場合、より複雑になります。ほとんどの MoE チェックポイントでは、各エキスパートは独立してシリアライズされています。DeepSeek-V3 チェックポイントのインデックスの中身を覗いてみると、以下のようなキーが見られます:

model.layers.3.mlp.experts.0.gate_proj.weight ... model.layers.3.mlp.experts.255.gate_proj.weight

各エキスパートには独自の重み行列セットがあり、要するに 256 個（DeepSeek-V3 を例にとれば 0 から 255 の合計）の小さな順方向ネットワークが並んで保存されています。しかし実行時には、GPU は最適化されたカーネルを実行します。グループ化 GEMM や融合 MoE 実装といった最新の MoE カーネルは、エキスパートを一つずつループ処理するのではなく、単一の操作ですべてのエキスパートを処理するように設計されています。

これを効率的に行うためには、エキスパートの重みを単一の連続したテンソルにパッキングする必要があります。

つまり、以下のような不一致が生じます:

チェックポイント: 256 個の別々のテンソル

実行時: 1 つのパックされたテンソル

このギャップを体系的に埋めることが、重み読み込みのリファクタリングによって可能になります。

汎用的な WeightConverter の導入により、思考モデルは以下のように変化しました:

「チェックポイントはすでに私の実行時のレイアウトと一致しており、読み込みは主にキーごとのコピーである」

から

「チェックポイントは単なるテンソルのシリアライズされたソースに過ぎず、読み込みは望む実行時レイアウトへ変換する変換パイプラインである」

WeightConverter による動的重み読み込み

このリファクタリングによって導入された中心的な抽象化は、WeightConverter を介した動的重み読み込みです。

WeightConverter

ソースキーパターン → ターゲットキー (複数可) + 演算

プリミティブ演算（チャンク、結合など）は合成可能です。MoE において特に有用な 2 つの演算:

MergeModulelist

WeightConverter( ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",], "mlp.experts.gate_up_proj", operations=[ MergeModulelist(dim=0), Concatenate(dim=1), ], )

SplitModulelist

WeightConverter( "mlp.experts.down_proj", "block_sparse_moe.experts.*.w2.weight", operations=[SplitModulelist(dim=0)], )

テンソルの遅延初期化 (Lazy Materialization)

このリファクタリングは、存在する変換の種類だけでなく、それらのスケジューリング方法も改善します。

ローダーはチェックポイントキーを一度スキャンし、コンバーターパターンと照合して、各コンバーターごとにテンソルをグループ化します。あるキーが必要であると識別されると、それは未来のタスクとして登録され、スレッドプールを通じて初期化されます。変換演算は、依存関係が満たされた後にのみ実行されます。例えば、MergeModulelist

これにより、繰り返されるスキャンが回避され、メモリのピーク値が削減されます。

ベンチマーク：重み読み込みパイプラインの改善

新しい重み読み込みパイプラインによって導入された改善を評価するために、transformers の v4 バージョンと v5 バージョンをベンチマークしました。

v4 と v5 を以下を使用してベンチマークしました:

v4 ブランチ: https://github.com/ariG23498/transformers/tree/bench-v4

v5 ブランチ: https://github.com/ariG23498/transformers/tree/bench-v5

from transformers import AutoModelForCausalLM model_id = "Qwen/Qwen1.5-110B-Chat" model = AutoModelForCausalLM.from_pretrained(model_id)

More Articles from our Blog

Hot Continuous batching from first principles

![](https://cdn-avatars.huggingface.co/v1/production/uploads/1666977434736-617bc8d1000dbbbf7c2

原文を表示

Back to Articles Mixture of Experts (MoEs) in Transformers

Upvote 11

Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered "too dangerous to release" 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was simple:

More data + more parameters gives better performance.

Scaling laws reinforced this trend, but dense scaling has practical limits:

Training becomes increasingly expensive.

Inference latency grows.

Deployment requires significant memory and hardware.

This is where Mixture of Experts (MoEs) enter the picture.

If you're already familiar with MoEs and want to jump straight into the engineering work done in transformers, you can head directly to Transformers and MoEs.

From Dense to Sparse: What Are MoEs?

A Mixture of Experts model keeps the Transformer backbone, but replaces certain dense feed-forward layers with a set of experts. An “expert” is not a topic-specialized module (e.g., "math expert", "code expert"). It is simply a learnable sub-network. For each token, a router selects a small subset of experts to process it.

Figure 1: Expert 1 among 4 experts is activated (Source: Maarten Grootendorst)

Different tokens activate different experts, based on their hidden representations.

Model capacity depends on total parameters, but inference speed depends on active parameters.

This is the key idea.

For example, take gpt-oss-20b

800 / (3.6 * 2)

This super fast speed confirms the model works approximately as a 3.6B parameter one, but it has the same capacity (or quality) as a 21B parameter model.

(Note: speed would be even faster if we used kernels for the native mxfp4 quantization the model uses).

MoEs are attractive for these reasons:

Better Compute Efficiency

Given a fixed training FLOP budget, MoEs often outperform dense counterparts.

Figure 2: Dense vs. MoE training curves (Source: OLMoE: Open Mixture-of-Experts Language Models)

This means faster iteration and better scaling efficiency.

A Natural Parallelization Axis

Experts provide a structural boundary in the computation graph. Since different tokens engage different experts, we can parallelize across experts (we discuss this later in Expert Parallelism).

Industry Adoption

Recent major MoE releases of open models that happened in the past few weeks include Qwen 3.5, MiniMax M2, GLM-5, or Kimi K2.5.

The trend accelerated after the success of DeepSeek R1 in January 2025, building on earlier systems like DeepSeek V2. Another early MoE was Mixtral-8x7B, released in December 2023.

Figure 3: 2-year timeline of MoE model addition to the transformers

Closed labs use MoEs too. ChatGPT has long been rumored to use a sparse architecture, and the open gpt-oss models certainly do.

If you want to learn more about MoEs in general, we strongly suggest reading this blog and watching our recent YouTube video on routing.

Transformers and MoEs

Most tooling in the ecosystem, including model loading, device placement, quantization, and backend execution was originally designed for dense models. MoEs challenge these assumptions.

Making MoEs first-class citizens in transformers

Weight Loading Refactor

Expert Parallelism

Training MoEs with transformers

Weight Loading Refactor

AutoModelForCausalLM.from_pretrained("model_id")

For MoEs, it’s more complicated. In most MoE checkpoints, each expert is serialized independently. If you peek inside the DeepSeek-V3 checkpoint index, you’ll see keys like:

model.layers.3.mlp.experts.0.gate_proj.weight ... model.layers.3.mlp.experts.255.gate_proj.weight

Each expert has its own set of weight matrices, essentially 256 (0 to 255 total, taking DeepSeek-V3 as an example) small feed-forward networks saved side by side. At runtime, however, GPUs execute optimized kernels. Modern MoE kernels such as grouped GEMMs and fused MoE implementations are designed to process all experts in a single operation, not by looping over them one at a time.

To do that efficiently, they require expert weights to be packed into a single contiguous tensor.

So we have a mismatch:

Checkpoint: 256 separate tensors

Runtime: 1 packed tensor

Bridging this gap systematically is what the weight loading refactor enables.

With the introduction of a generic WeightConverter, the mental model shifted from:

A checkpoint already matches my runtime layout; loading is mostly a key-by-key copy.

A checkpoint is just a serialized source of tensors. Loading is a conversion pipeline that transforms them into the runtime layout we want.

Dynamic Weight Loading with WeightConverter

The central abstraction introduced by this refactor is dynamic weight loading via a WeightConverter

WeightConverter

source key patterns → target key(s) + operations

Primitive operations (chunk, concatenate, etc.) are composable. Two that are particularly useful for MoEs:

MergeModulelist

WeightConverter( ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",], "mlp.experts.gate_up_proj", operations=[ MergeModulelist(dim=0), Concatenate(dim=1), ], )

SplitModulelist

WeightConverter( "mlp.experts.down_proj", "block_sparse_moe.experts.*.w2.weight", operations=[SplitModulelist(dim=0)], )

Lazy Materialization of Tensors

The refactor improves not just what conversions exist, but how they’re scheduled.

The loader scans checkpoint keys once, matches them against converter patterns, and groups tensors per converter. Once a key is identified as needed, it’s registered as a future and materialized via a thread pool. Conversion operations run only once their dependencies are ready. For example, MergeModulelist

This avoids repeated scans and reduces memory peaks.

Benchmark: Weight-Loading Pipeline Improvements

To evaluate the improvements introduced by the new weight-loading pipeline, we benchmarked the v4 vs v5 versions of transformers

We benchmarked v4 vs v5 using:

v4 branch: https://github.com/ariG23498/transformers/tree/bench-v4

v5 branch: https://github.com/ariG23498/transformers/tree/bench-v5

from transformers import AutoModelForCausalLM model_id = "Qwen/Qwen1.5-110B-Chat" model = AutoModelForCausalLM.from_pretrained(model_id)

Two relevant environment variables:

HF_ENABLE_PARALLEL_LOADING

HF_DEACTIVATE_ASYNC_LOAD

Model: Qwen/Qwen1.5-110B-Chat

device_map="auto"

Async (default)

device_map="auto"

Figure 4: Loading benchmarks (v4 vs v5)

The speedup is not just “more threads.”

It’s the combination of Single-pass routing, Async materialization, and Conversion-aware scheduling which together avoid unnecessary materialization and memory peaks while enabling expert packing and projection fusion at load time.

Where Quantization Fits In

With this refactor we can now create the runtime module structure first and then convert the weights into the structure. We can now optionally attach quantization within the conversion pipeline, making quantization part of the weight loading pipeline itself. This is crucial because quantizing “per expert” only makes sense once experts exist in a predictable packed layout.

This end to end pipeline was not possible earlier and now it comes to the users as an exposed API.

Once experts are packed into a single runtime tensor, another question arises:

How do you actually route through them efficiently?

In a Mixture of Experts model, each token is routed to different experts. This means the runtime must dispatch tokens to their selected expert weights, execute the projections efficiently, apply the routing weights and then collect and reorder the results.

This is what the Experts Backend system (introduced in PR #42697) addresses. The Experts Backend introduces a pluggable execution architecture that decouples expert computation from the model implementation. Instead of hardcoding one dispatch strategy inside each MoE model, the system allows expert layers to dynamically select a backend at runtime.

This is implemented via a decorator pattern:

@use_experts_implementation

The decorator wraps expert classes and dispatches computation to the selected backend automatically.

Three backends are currently provided:

torch._grouped_mm

Figure: Expert backend illustration

Expert Parallelism

Mixture of Experts (MoE) models can have hundreds of billions of parameters (far more than what fits on a single GPU). Expert parallelism (EP) addresses this by distributing experts across multiple devices. Each device loads only its assigned subset of experts, computes for those experts and then participates in result aggregation. This approach scales models to far larger parameter counts without increasing computation cost because each token activates only a few experts.

Expert parallelism is enabled via enable_expert_parallel

import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.distributed.configuration_utils import DistributedConfig distributed_config = DistributedConfig(enable_expert_parallel=True) model = AutoModelForCausalLM.from_pretrained( "openai/gpt-oss-120b", dtype="auto", distributed_config=distributed_config, )

torchrun --nproc-per-node N script.py

When enable_expert_parallel=True

Core components of EP lie in:

GroupedGemmParallel

num_experts / num_devices

Training MoEs with Transformers

MoEs are excellent for scaling inference, but training them is significantly more complex.

~12× faster MoE training

35% VRAM reduction

~6× longer context

12–30× overall speedup compared to v4

We leverage the Expert Backend abstraction, standardize around PyTorch’s torch._grouped_mm

For full details, we recommend reading: Unsloth’s official guide

As sparse architectures continue to evolve, we want the transformers library to evolve with them. If you’re building with MoEs or experimenting with new sparse ideas, we’d love to hear from you. Let us know what abstractions, kernels, or workflows you’d like to see next in transformers

More Articles from our Blog

Hot Continuous batching from first principles

![](https://cdn-avatars.huggingface.co/v1/production/uploads/1666977434736-617bc8d1000dbbbf7c2

この記事をシェア

Apple Machine Learning重要度42026年7月2日 09:00

MemoryLLM：トランスフォーマー向けのプラグ・アンド・プレイ型解釈可能なフィードフォワードメモリ

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

Hugging Face Blog重要度42026年7月1日 03:32

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年2月26日 09:00·約11分

トランスフォーマーにおけるエキスパート混合（MoEs）

#MoE #Transformer #LLM最適化 #Hugging Face #スパースモデル #計算効率

TL;DR

AI深層分析2026年2月26日 22:41

重要/ 5段階

キーポイント

従来の密な（dense）モデルのスケーリングには限界があり、MoEは計算効率の向上とトレーニングコスト削減を実現する

Hugging FaceのTransformersライブラリでMoEモデルが実装・利用可能になり、実践的な応用が進んでいる

影響分析・編集コメントを表示

影響分析

編集コメント

大規模言語モデルの実用化におけるコスト課題を解決する有望なアプローチとして、MoE技術の実装面での進展を分かりやすく解説している。

記事に戻る

トランスフォーマーにおけるエキスパートの混合 (MoEs)

11 件の高評価

より多くのデータ + より多くのパラメータ = より良いパフォーマンス。

スケール則はこの傾向を強化しましたが、稠密なモデルのスケール拡大には実用的な限界があります。

学習コストがますます高くなる。
推論レイテンシが増大する。
デプロイには大量のメモリとハードウェアが必要となる。

ここで登場するのがエキスパートの混合 (Mixture of Experts: MoEs) です。

稠密からスパースへ：MoE とは何か

図 1: 4 つのエキスパートのうち 1 つが活性化されている（出典：Maarten Grootendorst）

異なるトークンは、その隠れ表現に基づいて異なるエキスパートを活性化します。

モデルの容量は総パラメータ数に依存しますが、推論速度はアクティブなパラメータ数に依存します。

これが核心となる考え方です。

例えば、gpt-oss-20b を考えてみましょう。

800 / (3.6 * 2)

（注：モデルが使用するネイティブの mxfp4 量子化にカーネルを使用すれば、速度はさらに向上します）

MoE は以下の理由から魅力的です：

より優れた計算効率

固定されたトレーニング FLOP バジェットにおいて、MoE は稠密な対照モデルを凌駕することが多いです。

図 2: 稠密型と MoE のトレーニング曲線の比較（出典：OLMoE: Open Mixture-of-Experts Language Models）

これは、より迅速な反復と優れたスケーリング効率を意味します。

自然な並列化軸

業界における採用

過去数週間に発表された主要なオープンモデルの MoE リリースには、Qwen 3.5、MiniMax M2、GLM-5、Kimi K2.5 などがあります。

図 3: トランスフォーマーへの MoE モデル追加の 2 年間のタイムライン

MoE 全般についてさらに学びたい場合は、当ブログ記事の閲覧と、ルーティングに関する最近の YouTube ビデオのご視聴を強くお勧めします。

トランスフォーマーと MoE

トランスフォーマーにおける MoE のファーストクラス・シチズン化への取り組み

重み読み込みのリファクタリング
エキスパート並列性
トランスフォーマーを用いた MoE のトレーニング

重み読み込みのリファクタリング

AutoModelForCausalLM.from_pretrained("model_id")

model.layers.3.mlp.experts.0.gate_proj.weight ... model.layers.3.mlp.experts.255.gate_proj.weight

これを効率的に行うためには、エキスパートの重みを単一の連続したテンソルにパッキングする必要があります。

つまり、以下のような不一致が生じます:

チェックポイント: 256 個の別々のテンソル

実行時: 1 つのパックされたテンソル

このギャップを体系的に埋めることが、重み読み込みのリファクタリングによって可能になります。

汎用的な WeightConverter の導入により、思考モデルは以下のように変化しました:

「チェックポイントはすでに私の実行時のレイアウトと一致しており、読み込みは主にキーごとのコピーである」

から

WeightConverter による動的重み読み込み

このリファクタリングによって導入された中心的な抽象化は、WeightConverter を介した動的重み読み込みです。

WeightConverter

ソースキーパターン → ターゲットキー (複数可) + 演算

プリミティブ演算（チャンク、結合など）は合成可能です。MoE において特に有用な 2 つの演算:

MergeModulelist

WeightConverter( ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",], "mlp.experts.gate_up_proj", operations=[ MergeModulelist(dim=0), Concatenate(dim=1), ], )

SplitModulelist

WeightConverter( "mlp.experts.down_proj", "block_sparse_moe.experts.*.w2.weight", operations=[SplitModulelist(dim=0)], )

テンソルの遅延初期化 (Lazy Materialization)

このリファクタリングは、存在する変換の種類だけでなく、それらのスケジューリング方法も改善します。

これにより、繰り返されるスキャンが回避され、メモリのピーク値が削減されます。

ベンチマーク：重み読み込みパイプラインの改善

新しい重み読み込みパイプラインによって導入された改善を評価するために、transformers の v4 バージョンと v5 バージョンをベンチマークしました。

v4 と v5 を以下を使用してベンチマークしました:

v4 ブランチ: https://github.com/ariG23498/transformers/tree/bench-v4

v5 ブランチ: https://github.com/ariG23498/transformers/tree/bench-v5

from transformers import AutoModelForCausalLM model_id = "Qwen/Qwen1.5-110B-Chat" model = AutoModelForCausalLM.from_pretrained(model_id)

More Articles from our Blog

Hot Continuous batching from first principles

![](https://cdn-avatars.huggingface.co/v1/production/uploads/1666977434736-617bc8d1000dbbbf7c2

原文を表示

Back to Articles Mixture of Experts (MoEs) in Transformers

Upvote 11

More data + more parameters gives better performance.

Scaling laws reinforced this trend, but dense scaling has practical limits:

Training becomes increasingly expensive.

Inference latency grows.

Deployment requires significant memory and hardware.

This is where Mixture of Experts (MoEs) enter the picture.

If you're already familiar with MoEs and want to jump straight into the engineering work done in transformers, you can head directly to Transformers and MoEs.

From Dense to Sparse: What Are MoEs?

Figure 1: Expert 1 among 4 experts is activated (Source: Maarten Grootendorst)

Different tokens activate different experts, based on their hidden representations.

Model capacity depends on total parameters, but inference speed depends on active parameters.

This is the key idea.

For example, take gpt-oss-20b

800 / (3.6 * 2)

This super fast speed confirms the model works approximately as a 3.6B parameter one, but it has the same capacity (or quality) as a 21B parameter model.

(Note: speed would be even faster if we used kernels for the native mxfp4 quantization the model uses).

MoEs are attractive for these reasons:

Better Compute Efficiency

Given a fixed training FLOP budget, MoEs often outperform dense counterparts.

Figure 2: Dense vs. MoE training curves (Source: OLMoE: Open Mixture-of-Experts Language Models)

This means faster iteration and better scaling efficiency.

A Natural Parallelization Axis

Experts provide a structural boundary in the computation graph. Since different tokens engage different experts, we can parallelize across experts (we discuss this later in Expert Parallelism).

Industry Adoption

Recent major MoE releases of open models that happened in the past few weeks include Qwen 3.5, MiniMax M2, GLM-5, or Kimi K2.5.

The trend accelerated after the success of DeepSeek R1 in January 2025, building on earlier systems like DeepSeek V2. Another early MoE was Mixtral-8x7B, released in December 2023.

Figure 3: 2-year timeline of MoE model addition to the transformers

Closed labs use MoEs too. ChatGPT has long been rumored to use a sparse architecture, and the open gpt-oss models certainly do.

If you want to learn more about MoEs in general, we strongly suggest reading this blog and watching our recent YouTube video on routing.

Transformers and MoEs

Most tooling in the ecosystem, including model loading, device placement, quantization, and backend execution was originally designed for dense models. MoEs challenge these assumptions.

Making MoEs first-class citizens in transformers

Weight Loading Refactor

Expert Parallelism

Training MoEs with transformers

Weight Loading Refactor

AutoModelForCausalLM.from_pretrained("model_id")

For MoEs, it’s more complicated. In most MoE checkpoints, each expert is serialized independently. If you peek inside the DeepSeek-V3 checkpoint index, you’ll see keys like:

model.layers.3.mlp.experts.0.gate_proj.weight ... model.layers.3.mlp.experts.255.gate_proj.weight

To do that efficiently, they require expert weights to be packed into a single contiguous tensor.

So we have a mismatch:

Checkpoint: 256 separate tensors

Runtime: 1 packed tensor

Bridging this gap systematically is what the weight loading refactor enables.

With the introduction of a generic WeightConverter, the mental model shifted from:

A checkpoint already matches my runtime layout; loading is mostly a key-by-key copy.

A checkpoint is just a serialized source of tensors. Loading is a conversion pipeline that transforms them into the runtime layout we want.

Dynamic Weight Loading with WeightConverter

The central abstraction introduced by this refactor is dynamic weight loading via a WeightConverter

WeightConverter

source key patterns → target key(s) + operations

Primitive operations (chunk, concatenate, etc.) are composable. Two that are particularly useful for MoEs:

MergeModulelist

WeightConverter( ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",], "mlp.experts.gate_up_proj", operations=[ MergeModulelist(dim=0), Concatenate(dim=1), ], )

SplitModulelist

WeightConverter( "mlp.experts.down_proj", "block_sparse_moe.experts.*.w2.weight", operations=[SplitModulelist(dim=0)], )

Lazy Materialization of Tensors

The refactor improves not just what conversions exist, but how they’re scheduled.

This avoids repeated scans and reduces memory peaks.

Benchmark: Weight-Loading Pipeline Improvements

To evaluate the improvements introduced by the new weight-loading pipeline, we benchmarked the v4 vs v5 versions of transformers

We benchmarked v4 vs v5 using:

v4 branch: https://github.com/ariG23498/transformers/tree/bench-v4

v5 branch: https://github.com/ariG23498/transformers/tree/bench-v5

from transformers import AutoModelForCausalLM model_id = "Qwen/Qwen1.5-110B-Chat" model = AutoModelForCausalLM.from_pretrained(model_id)

Two relevant environment variables:

HF_ENABLE_PARALLEL_LOADING

HF_DEACTIVATE_ASYNC_LOAD

Model: Qwen/Qwen1.5-110B-Chat

device_map="auto"

Async (default)

device_map="auto"

Figure 4: Loading benchmarks (v4 vs v5)

The speedup is not just “more threads.”

Where Quantization Fits In

This end to end pipeline was not possible earlier and now it comes to the users as an exposed API.

Once experts are packed into a single runtime tensor, another question arises:

How do you actually route through them efficiently?

This is implemented via a decorator pattern:

@use_experts_implementation

The decorator wraps expert classes and dispatches computation to the selected backend automatically.

Three backends are currently provided:

torch._grouped_mm

Figure: Expert backend illustration

Expert Parallelism

Expert parallelism is enabled via enable_expert_parallel

torchrun --nproc-per-node N script.py

When enable_expert_parallel=True

Core components of EP lie in:

GroupedGemmParallel

num_experts / num_devices

Training MoEs with Transformers

MoEs are excellent for scaling inference, but training them is significantly more complex.

~12× faster MoE training

35% VRAM reduction

~6× longer context

12–30× overall speedup compared to v4

We leverage the Expert Backend abstraction, standardize around PyTorch’s torch._grouped_mm

For full details, we recommend reading: Unsloth’s official guide

More Articles from our Blog

Hot Continuous batching from first principles

![](https://cdn-avatars.huggingface.co/v1/production/uploads/1666977434736-617bc8d1000dbbbf7c2

この記事をシェア

Apple Machine Learning重要度42026年7月2日 09:00

MemoryLLM：トランスフォーマー向けのプラグ・アンド・プレイ型解釈可能なフィードフォワードメモリ

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

Hugging Face Blog重要度42026年7月1日 03:32

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

トランスフォーマーにおけるエキスパート混合（MoEs）

キーポイント

影響分析

編集コメント

関連記事

トランスフォーマーにおけるエキスパート混合（MoEs）

キーポイント

影響分析

編集コメント

関連記事