Allen AI (AI2)·2026年5月8日 17:00·約12分

EMO：データからモジュール型専門家が自然発生的に出現する事前学習混合専門家モデル

#Mixture of Experts #Model Efficiency #Emergent Behavior #Allen AI #MoE Architecture

TL;DR

Allen AI は、データから自律的にモジュール化された専門家グループが出現する「EMO」という新しい混合専門家モデルを発表し、タスク固有のサブセット選択が可能になりつつもほぼフルモデル並みの性能を維持できる技術を実現した。

AI深層分析2026年5月9日 01:04

重要/ 5段階

深度40%

キーポイント

自己組織化によるモジュール性の発現

従来の手動設計ではなく、学習プロセスを通じてデータから自然に「専門家のグループ」が出現する仕組みを採用している。

タスク特化型サブセットの選択機能

ユーザーは特定のタスクに必要な小規模な専門家サブセットのみを選択して利用でき、計算リソースを最適化できる。

フルモデル性能の維持

一部の専門家のみを使用する選択的な推論においても、全モデル全体を使用した場合と同等に近いパフォーマンスを達成している。

影響分析・編集コメントを表示

影響分析

この技術は、大規模言語モデルやマルチモーダルモデルにおける「効率化」と「汎用性」のトレードオフを解決する重要な一歩となる。特に、リソース制約のある環境でも高性能な推論を実現するための新しいパラダイムを示しており、今後より軽量かつ高度なAIシステムの開発に大きな影響を与える可能性がある。

編集コメント

「手動設計」から「データからの発現」へというパラダイムシフトは、AI モデルの進化において極めて重要な転換点です。リソース効率化を追求する現代のトレンドに合致した画期的なアプローチと言えます。

本日、EMO をリリースします。これは、人間が定義した事前知識に依存せず、データから直接モジュラー構造が出現するようにエンドツーエンドで事前学習された新しい混合専門家（Mixture-of-Experts: MoE）モデルです。EMO を使用すれば、特定のタスクに対して専門家のごく一部（全体の 12.5%）のみを使用しながらも、ほぼフルモデル並みの性能を維持できます。また、すべての専門家を併用した場合は強力な汎用モデルとしても機能します。

大規模言語モデルは通常、モノリス型システムとしてトレーニングされ、デプロイされます。つまり、単一のモデルが初期化され、事前学習され、ファインチューニングされ、一つの統合されたエンティティとして提供されるのです。しかし、アプリケーションではコード生成、数学的推論、またはドメイン固有の知識など、必要な機能の一部のみを必要とするケースが多くあります。最先端の言語モデルが routinely 兆単位のパラメータに到達する中、フルモデルを使用・適応させることはほとんどのユーザーにとって現実的ではなく、不要な可能性のあるパラメータをホストするための不必要な計算コストとメモリ負担を伴います。

混合専門家（Mixture-of-Experts: MoE）モデルは、この制約を緩和する自然な方法のように思われます。各層で1つの大きなフィードフォワードネットワークを使用するのではなく、MoE は多数の小さなネットワーク、すなわち「専門家」を含み、入力トークンごとにその一部のみを活性化します。原理的には、特定の機能のみが必要なタスクであれば、関連する専門家のみをロードすればよいのです。

しかし、実際には既存の MoE（Mixture of Experts）モデルでも、良好に動作させるためにはフルモデルが必要となります。単一の入力内であっても、異なるトークンがしばしば異なるエキスパートを活性化するため、タスクの実行中にはすべてのエキスパートが使われてしまうことがあります。私たちが論文で示すように、これは標準的な MoE のエキスパートが、より高次なドメインや機能ではなく、前置詞や句読点といった低レベルの語彙パターンに特化していることが一因となっています。その結果、小規模なエキスパートのサブセットだけでは、信頼性を持って単独で使用することができません。

私たちは代わりに、選択的に使用し、組み合わせることが可能な一貫したグループとして組織化されたエキスパートを持つ MoE モデルを望んでいます。

これを事前学習中に促す一つの方法は、数学、生物学、コードといった事前に定義された意味ドメインに基づいてトークンをエキスパートにルーティングすることです。BTX や私たちの FlexOlmo プロジェクトなどの先行研究がこれに取り組んできました。しかし、事前に定義されたドメインには重要な制限があります。それらは事前学習コーパス全体にわたるドメインラベルを必要としますが、これらは曖昧であり、取得コストも高いものです。また、モデルの自己組織化の方法に過度な人間のバイアスを注入する恐れもあります。さらに重要なのは、ドメインを事前に固定してしまうと、モデルのモジュラー構造も固定されてしまうことです。推論時に新しいドメインや機能が出現した場合、どのエキスパートを使用すべきかが明確でなくなります。

ここで EMO が登場します。

EMO は、1 トリリオンのトークンでトレーニングされた 1B アクティブ・14B 総パラメータ（8 エキスパートアクティブ、128 エキスパート総数）の MoE（Mixture of Experts：エキスパート混合モデル）であり、選択的なエキスパートの使用をサポートします。つまり、特定のタスクやドメインに対しては、全エキスパートのごく一部（全体の 12.5% に相当）のみを使用しながらも、ほぼフルモデル並みの性能を維持できるのです。同時に、すべてのエキスパートを併用した場合でも、EMO は強力な汎用モデルとしての役割を果たします。一方、同じデータでトレーニングされた同等のアーキテクチャを持つ標準的な MoE では、エキスパートサブセットを選択的に使用すると著しい性能低下が見られます。

どのようにしてモジュラリティ（機能分割性）を創発させるのか？

MoE において、どのトークンがどのエキスパートを活性化するかを決定するのは、ルーターと呼ばれる小さなネットワークです。私たちは、ルーターが類似したドメインに属するトークンは、類似したエキスパートサブセットを活性化することを学習することを望んでいます。私たちの重要な観察点は、「同じ文書内のトークンは通常、同じドメインから来る」という事実です。そこで、文書の境界を弱い教師信号として利用します。トレーニング中、ある文書内のすべてのトークンは、共有されたエキスパートプールの中からアクティブなエキスパートを選択するように制限されます。

例えば、10 人の専門家が総数で存在し、トークンごとに 2 人の専門家が発動する MoE（Mixture of Experts）において、ドキュメント内のすべてのトークンは、上記の図に示されるように、4 人の専門家の同じプール内でのみルーティングするように制限されます。このプールはルーター自体によって選択されます：ドキュメント内の全トークンにわたるルーターの専門家選好を平均化し、最も頻繁に使用された専門家を選択して、ドキュメント共有プールとします。異なるドキュメントでは異なるプールを使用できるため、専門家のグループがトレーニングデータから直接再発現することが可能になります。

システムを実装する際にはいくつか考慮すべき点があります：

負荷分散。 1 つの技術的課題は負荷分散（load balancing）です。標準的な MoE トレーニングでは、モデルが少数の専門家に収縮するのを防ぐために負荷分散の目的関数が使用されます。一見すると、これは EMO のトレーニング目的と矛盾しているように思えます：私たちは明示的に各ドキュメントが専門家のサブセットのみを使用するように制限しています。

この対立は、通常負荷分散が適用されるスケールに起因します。多くの MoE 実装では、負荷分散は局所的に計算され、しばしば少数のドキュメントのみを含むマイクロバッチ内で計算されます。この局所的な目的関数は、同じドキュメント内のトークンを多数の専門家に分散させる方向に作用し、ドキュメント内での専門家使用の一貫性を保つという EMO の目的と直接対立します。

これを解決するために、私たちは多数の文書にわたってグローバルな負荷分散を適用します。このより大きなスケールでは、2 つの目的は補完的になります：EMO は同じ文書内のトークンが一貫した専門家プールを使用することを促し、一方、グローバルな負荷分散は異なる文書が集合的にすべての専門家をカバーすることを促します。実際には、安定したトレーニングのためにグローバルな負荷分散が重要であることを発見しました。

ドキュメントプールサイズ: ドキュメントプールサイズは、モジュラリティ制約がいかに厳格であるかを制御します。より小さなプールは、同じ文書内のトークンにより限られた専門家のセットを共有させることで、より強いモジュラリティを促しますが、より大きなプールはモデルにより柔軟性を与えますが、制約を弱めます。

単一のプールサイズに固定するのではなく、トレーニング中にランダムにサンプリングします。これにより、EMO が単一のサブセットサイズに過学習することを防ぎ、推論時に異なる専門家サブセットサイズをサポートできるようになります。

ベンチマーク結果

汎用ベンチマークにおいて、EMO は標準的な MoE モデル（Mixture of Experts）のパフォーマンスと同等であることを示し、モジュラリティの目的がフルモデルのパフォーマンスを犠牲にしないことがわかります。しかし、より重要な質問は、専門家のサブセットのみを保持した場合でもモデルが機能するかどうかです。この設定では、少量のタスク検証データにおけるルーティング使用量に基づいて専門家をランク付けし、最も使用される専門家のみを保持して他を破棄することで、タスク固有の専門家サブセットを構築します。

以下の図は、選択的な専門家使用においても EMO が堅牢であることを示しています。専門家の 25%（32 名の専門家サブセット）のみを保持した場合でも、EMO はすべてのベンチマークで絶対性能が約 1% しか低下しません；さらに専門家の 12.5%（16 名の専門家サブセット）のみを保持しても、全体の低下は約 3% に留まります。これはファインチューニングの前と後の両方で成立します。一方、標準的な MoE（Mixture of Experts：エキスパートの混合モデル）では、専門家のサブセットが小さくなるにつれて性能が急激に劣化し、最も小さな専門家サブセット設定ではランダムな性能に近い、あるいはそれ以下まで低下することがよくあります。

さらに、タスクに対して適切な専門家を選択することは驚くほど低コストであることも示しています。数ショットのデモンストレーションを含む単一の例だけで、完全な検証セットを使用して選択されたものと同程度の性能を発揮するモジュールを特定できます。また、EMO は特定の選択方法に縛られておらず、既存の専門家プルーニング手法（Easy-EP など）ともよく機能し、両者は互いに補完し合います。

専門家のサブセットは何に特化しているのか？

EMO がトレーニング後に実際に何を学習したかを確認するため、12,000 の事前トレーニング文書における最初の 100 トークン全体にわたるルーターの活性化をクラスタリングしました。標準的な MoE との違いは際立っています。

EMO のトークンクラスターは、*Health, Medical & Wellness*（健康・医療・ウェルネス）、*News Reporting*（ニュース報道）、米国における *Politics & Elections*（政治と選挙）、そして *Film & Music*（映画と音楽）といったものを対応しています。一方、標準的な MoE（Mixture of Experts：エキスパートの混合モデル）では、*Prepositions*（前置詞）、*Proper Names*（固有名詞）、*Copula Verbs*（動詞 be 系）、あるいは *Definite Articles*（定冠詞）といったクラスターが生成されます。EMO では、ある文書からのトークンは主に同じクラスターに割り当てられますが、標準的な MoE ではそれらが多数のクラスターに散らばってしまいます。

この対比は、単一の例を見れば最も明確です。健康に関する記事を考えてみましょう。EMO では、ほぼすべてのトークンが *Health, Medical & Wellness* クラスターへルーティングされます。一方、標準的な MoE では上位のクラスターは *Possessives & Definite Articles*（所有格と定冠詞）となり、モデルはその文書の内容に関係なく、*the* や *your* という単語を使用しているあらゆる他のテキストと同じグループに記事を分類してしまいます。

EMO は表面の特徴ではなく意味論的なドメインに対応するモジュールを形成するため、少数のエキスパートサブセットを選択しても機能するモデルとして成立します。このグループは実際の能力に対応しているのです。

our interactive visualization で、ご自身でもクラスターリングの結果を試すことができます。

公開するもの

私たちは、完全な EMO 訓練済みモデル、同じデータで訓練された対応する標準 MoE ベースライン、そしてトレーニングコードを公開します。これらのアーティファクトが、MoE における創発的モジュラリティを研究している他のグループにとって有用であることを願っています。

まだやるべき仕事は残っています。EMO は大規模スパースモデルをよりモジュラーにするための初期段階ですが、多くの疑問が残されています：どのようにして専門家サブセットをよりよく選択・構成するか、フルモデルを乱すことなくモジュールを更新する方法、そしてモジュラー構造を用いて解釈可能性と制御性を高める方法です。これらのモデルを公開することで、コミュニティがこれらの問いを検討し、展開や適応、検査、構成が容易なモジュラー言語モデルの構築へと進む手助けとなるでしょう。

最新の Ai2 ニュースに関する月次更新を受け取るには、購読してください。

原文を表示

Today we're releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts – just 12.5% of the total – for a given task while keeping near full-model performance, and still works as a strong general-purpose model when all experts are used together.

Large language models are typically trained and deployed as monolithic systems: a single model is initialized, pretrained, fine-tuned, and served as one unified entity. But applications often need only a subset of capabilities—such as code generation, mathematical reasoning, or domain-specific knowledge. As frontier language models routinely reach trillions of parameters, using and adapting the full model becomes impractical for most users and incurs unnecessary computational cost and memory to host parameters that may not even be needed.

Mixture-of-experts (MoE) models seem like a natural way to relax this constraint. Instead of using one large feedforward network at each layer, MoEs contain many smaller ones, called experts, and activate only a small subset for each input token. In principle, a task that only needs one capability could load only the relevant experts.

In practice, however, existing MoEs still need the full model to work well. Even within a single input, different tokens often activate different experts, so a task can end up using all the experts during its generation. As we show in our paper, this happens partly because experts in standard MoEs often specialize in low-level lexical patterns like prepositions or punctuation rather than higher-level domains or capabilities. As a result, small subsets of experts are not reliably usable on their own.

We instead want MoE models whose experts organize into coherent groups that can be selectively used and composed.

One way to encourage this during pretraining is to route tokens to experts based on predefined semantic domains, such as math, biology, or code. Prior work like BTX and our FlexOlmo project has tried this. However, predefined domains come with important limitations. They require domain labels across the pretraining corpus, which can be ambiguous and expensive to obtain, and they may inject too much human bias into how the model is allowed to organize itself. More importantly, fixing the domains upfront also fixes the model’s modular structure: if a new domain or capability emerges at inference time, it isn’t obvious which experts should be used.

That’s where EMO comes in.

We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance. At the same time, when all experts are used together, EMO remains a strong general-purpose model. In contrast, a standard MoE of equal architecture trained on the same data shows severe degradation when selectively using its expert subsets.

How do we get modularity to emerge?

In an MoE, a small network called the router decides which experts each token activates. We want the router to learn that tokens from similar domains should activate similar subsets of experts. Our key observation is that* tokens from the same document usually come from the same domain*. We therefore use document boundaries as a weak supervisory signal: during training, all tokens in a document are restricted to choose their active experts from a shared expert pool.

For example, in an MoE with 10 total experts and 2 active experts per token, all tokens in a document are restricted to route within the same pool of 4 experts, as shown in the figure above. This pool is chosen by the router itself: we average the router’s expert preferences across all tokens in the document, then select the most-used experts as the document’s shared pool. Different documents can use different pools, allowing recurring expert groups to emerge directly from the training data.

There are a few considerations when implementing the system:

Load balancing. One technical challenge is load balancing. In standard MoE training, the load-balancing objective is used to prevent the model from collapsing onto only a small number of experts. At first glance, this seems to conflict with EMO’s training objective: we are explicitly restricting each document to use only a subset of experts.

The conflict comes from the scale at which load balancing is usually applied. In many MoE implementations, load balancing is computed locally, often within a micro-batch containing only a small number of documents. This local objective can push tokens within the same document to spread across many experts, directly opposing EMO’s objective of keeping expert usage consistent within a document.

To resolve this, we apply load balancing globally across many documents. At this larger scale, the two objectives become complementary: EMO encourages tokens within the same document to use a coherent expert pool, while global load balancing encourages different documents to collectively cover all experts. In practice, we found that global load-balancing is important for stable training.

Document pool size: The document pool size controls how restrictive the modularity constraint is. A smaller pool forces tokens in the same document to share a tighter set of experts, encouraging stronger modularity; a larger pool gives the model more flexibility but weakens the constraint.

Rather than fixing one pool size, we randomly sample it during training. This prevents EMO from overfitting to a single subset size and lets it support different expert subset sizes at inference time.

Benchmark results

On general-purpose benchmarks, EMO matches the performance of a standard MoE model, showing that the modularity objective does not come at the cost of full-model performance. The more important question, however, is whether the model can still work when we only keep a subset of experts. In this setting, we construct task-specific expert subsets by ranking experts according to their routing usage on a small amount of task validation data, keeping the most-used experts and discarding the rest.

The figure below shows that EMO remains robust under selective expert use. When we keep only 25% of the experts (32 expert subset), EMO loses only about 1% absolute performance across all benchmarks; even when we keep only 12.5% of the experts (16 expert subset), the overall drop is only about 3%. This holds both before and after fine-tuning. In contrast, the matching standard MoE degrades sharply as the expert subset gets smaller, often falling close to or below random performance in the smallest expert subset settings.

Furthermore, we show that selecting the right experts for a task is surprisingly cheap—a single example with few-shot demonstrations is enough to identify a module that performs on par with one selected using a full validation set. And EMO isn't tied to any particular selection method: it works well with existing expert-pruning approaches like Easy-EP, and the two complement each other.

What are expert subsets specializing to?

To see what EMO actually learned after training, we clustered router activations of the first 100 tokens across 12K pretraining documents. The difference from a standard MoE is stark.

EMO's token clusters correspond to things like *Health, Medical & Wellness*, *News Reporting*, US *Politics & Elections*, and *Film & Music*. A standard MoE produces clusters like *Prepositions*, *Proper Names*, *Copula Verbs*, or *Definite Articles*. In EMO, tokens from a given document mostly land in the same cluster; in a standard MoE, they end up scattered across many.

The contrast is easiest to see on a single example. Take a health article—in EMO, almost every token would route into the *Health, Medical & Wellness *cluster. In a standard MoE, the top cluster is *Possessives & Definite Articles*; the model would group the article with every other text that happens to use the word *the* or *your*, regardless of what that text is about.

Because EMO forms modules that map to semantic domains rather than surface features, you can pick a small expert subset and still have a functioning model—the group corresponds to a real capability.

You can play around with the clustering results yourself in our interactive visualization.

What we're releasing

We’re releasing the full EMO-trained model, a matched standard-MoE baseline trained on the same data, and the training code. We hope these artifacts are useful for other groups studying emergent modularity in MoEs.

There’s more work to do. EMO is an early step toward making large sparse models more modular, but many questions remain: how to better select and compose expert subsets, how to update modules without disrupting the full model, and how to use modular structure for better interpretability and control. Releasing these models should help the community to study these questions and build toward modular language models that are easier to deploy, adapt, inspect, and compose.

Subscribe to receive monthly updates about the latest Ai2 news.

この記事をシェア

Hugging Face Blog2026年5月9日 01:03

EMO：専門家の混合による突発的モジュラリティのための事前学習

Allen AI (AI2)重要度42026年6月25日 17:00

ハイブリッドモデルはどのトークンをより正確に予測するか？

Allen AI (AI2)2026年6月18日 17:00

Domyn と AISquared が Ai2 のオープンリリースをどう活用したか

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Allen AI (AI2)·2026年5月8日 17:00·約12分

EMO：データからモジュール型専門家が自然発生的に出現する事前学習混合専門家モデル

#Mixture of Experts #Model Efficiency #Emergent Behavior #Allen AI #MoE Architecture

TL;DR

AI深層分析2026年5月9日 01:04

重要/ 5段階

深度40%

キーポイント

自己組織化によるモジュール性の発現

従来の手動設計ではなく、学習プロセスを通じてデータから自然に「専門家のグループ」が出現する仕組みを採用している。

タスク特化型サブセットの選択機能

ユーザーは特定のタスクに必要な小規模な専門家サブセットのみを選択して利用でき、計算リソースを最適化できる。

フルモデル性能の維持

一部の専門家のみを使用する選択的な推論においても、全モデル全体を使用した場合と同等に近いパフォーマンスを達成している。

影響分析・編集コメントを表示

影響分析

編集コメント

ここで EMO が登場します。

どのようにしてモジュラリティ（機能分割性）を創発させるのか？

システムを実装する際にはいくつか考慮すべき点があります：

ベンチマーク結果

専門家のサブセットは何に特化しているのか？

our interactive visualization で、ご自身でもクラスターリングの結果を試すことができます。

公開するもの

最新の Ai2 ニュースに関する月次更新を受け取るには、購読してください。

原文を表示

We instead want MoE models whose experts organize into coherent groups that can be selectively used and composed.

That’s where EMO comes in.

How do we get modularity to emerge?

There are a few considerations when implementing the system:

Benchmark results

What are expert subsets specializing to?

To see what EMO actually learned after training, we clustered router activations of the first 100 tokens across 12K pretraining documents. The difference from a standard MoE is stark.

You can play around with the clustering results yourself in our interactive visualization.

What we're releasing

Subscribe to receive monthly updates about the latest Ai2 news.

この記事をシェア

Hugging Face Blog2026年5月9日 01:03

EMO：専門家の混合による突発的モジュラリティのための事前学習

Allen AI (AI2)重要度42026年6月25日 17:00

ハイブリッドモデルはどのトークンをより正確に予測するか？

Allen AI (AI2)2026年6月18日 17:00

Domyn と AISquared が Ai2 のオープンリリースをどう活用したか

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

EMO：データからモジュール型専門家が自然発生的に出現する事前学習混合専門家モデル

キーポイント

影響分析

編集コメント

ベンチマーク結果

専門家のサブセットは何に特化しているのか？

公開するもの

How do we get modularity to emerge?

Benchmark results

What are expert subsets specializing to?

What we're releasing

Subscribe to receive monthly updates about the latest Ai2 news.

関連記事

EMO：データからモジュール型専門家が自然発生的に出現する事前学習混合専門家モデル

キーポイント

影響分析

編集コメント

ベンチマーク結果

専門家のサブセットは何に特化しているのか？

公開するもの

How do we get modularity to emerge?

Benchmark results

What are expert subsets specializing to?

What we're releasing

Subscribe to receive monthly updates about the latest Ai2 news.

関連記事