読み込み中…

Simon Willison Blog·2026年6月3日 07:21·約4分

マイクロソフト、新しい MAI モデルを発表

#LLM #MoE #Reasoning #Code Generation #Data Licensing #Microsoft

TL;DR

マイクロソフトが推論特化型「MAI-Thinking-1」とコード特化型「MAI-Code-1-Flash」の2つの新モデルを発表したが、データライセンスに関する初期の楽観的な見解は訂正され、実際にはウェブクローリングデータを基にしていることが判明した。

AI深層分析2026年6月3日 12:47

重要/ 5段階

深度40%

キーポイント

新モデルの発表と仕様

推論特化の「MAI-Thinking-1」（総パラメータ1T、アクティブ35B）とコード特化の「MAI-Code-1-Flash」（総パラメータ137B、アクティブ5B）が発表され、後者は GitHub Copilot への展開を予定している。

データライセンスに関する訂正

記事執筆者は当初「未許可のウェブダンプを使わない初のモデル」と期待したが、技術論文の詳細により、両モデルとも約1.2兆ページの商用クローリングデータを基に訓練されていることが判明した。

性能評価と比較

マイクロソフトは MAI-Thinking-1 が「Sonnet 4.6」の盲検人間評価で優れていると主張しており、大規模モデルに対する低パラメータ・MoE（Mixture of Experts）アプローチの有効性を示唆している。

技術情報の訂正

執筆者は初期の投稿でパラメータ数を誤読していたが、モデルカードや技術論文を参照して総パラメータ数とアクティブパラメータ数の正確な値を修正した。

重要な引用

"We trained [MAI-Thinking-1] from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models."

"The majority of our web HTML corpus comes from a proprietary crawl... approximately 1.2 trillion pages are crawled and parsed."

"is preferred to Sonnet 4.6 in our blind human side-by-side evaluations"

影響分析・編集コメントを表示

影響分析

このニュースは、大手テック企業が「ライセンスされたデータ」を強調してモデルの品質と倫理性をアピールする一方で、実態は大規模なウェブクローリングに依存しているという業界全体の構造的課題を浮き彫りにしています。また、MoE アーキテクチャによる大規模モデルのパフォーマンス維持は、コスト効率化とエッジコンピューティングへの応用において重要な技術的転換点を示唆しており、開発者や企業にとって実装戦略の見直しを迫る内容です。

編集コメント

ベンダーが「ライセンスされたデータ」を強調する声明と、技術論文で明かされる大規模ウェブクローリングの実態とのギャップは、AI業界におけるデータ透明性の課題を象徴しています。また、執筆者自身によるパラメータ数の訂正プロセスも、複雑な MoE 構造の理解が依然として難しいことを示唆しています。

Microsoft は今朝、2 つの新しいテキスト LLM を発表しました。MAI-Thinking-1（推論用、1T パラメータ、35B アクティブ、「選抜された初期パートナー」に提供可能）と MAI-Code-1-Flash（137B パラメータ、5B アクティブ、「GitHub Copilot や VS Code 向けに高パフォーマンスと低コストを実現するために特別に設計され [...] Visual Studio Code の GitHub Copilot 個人ユーザーに向けて段階的に展開中」）です。まだ実際に試すことはできていません。

特に現在、大規模なモデルへのアクセスが非常に高額であるという状況を考えると、Microsoft がこのようなパラメータ数の少ないモデルをリリースするのは非常に興味深いです。彼らは MAI-Thinking-1 について、「当社の盲検人間による並列評価では Sonnet 4.6 よりも優れていると判断された」と主張していますが、これは私自身が自分のラップトップでその規模以上のモデルを頻繁に実行していることを考えると、35B モデルとしては驚異的な成果です。（追記：私の認識は完全に誤っていました。以下の注記をご覧ください。）

また、こちらにも注目すべき点があります。

私たちは [MAI-Thinking-1] を、サードパーティのモデルからの蒸留（distillation）を行わず、エンタープライズグレードでクリーンかつ商業的にライセンスされたデータからゼロからトレーニングしました。

そして MAI-Code-1-Flash についても同様です：

これは Microsoft によって、クリーンで適切にライセンスされたデータを用いてエンドツーエンドで構築されました。

私はこの「適切にライセンスされた」データについて、もっと詳しく学びたいと強く願っています！これらは、ウェブの無許可ダンプをトレーニングデータとして使用しなかった、初めての実用的なコード特化型モデルとなるのでしょうか？（更新：答えはいいえです。以下の注記をご覧ください。）

更新: 私の当初公開したノートでは、モデルのサイズを誤って記載してしまいました。マイクロソフトの発表を読み違え、MoE のアクティブパラメータ数を総パラメータ数と解釈してしまいましたが、MAI-Code-1-Flash のモデルカードによると、総数は 137B でアクティブ数は 5B と記載されています。また、MAI-Thinking-1 の技術論文では、これは総数 1T（テラ）でアクティブ数が 35B のモデルであることが明らかになっています。

この誤りについて深くお詫び申し上げます。

更新 2: その技術論文では、80 ページ以降にトレーニングデータの詳細な記述があります。そこには他の主要な大規模言語モデル（LLM）と同様のライセンス上の問題があり、これはパブリックウェブのクロールを基にトレーニングされたものです：

ウェブ HTML コーパスの大部分は、独自のクローリングによって収集されたものです。初期のページ発見と選別を経て、約 1.2 兆ページのページがクローリングされ、解析されます。[...] Microsoft の標準ポリシー第 2.4 条に加え、UT1 ブロックリスト（Prigent, 2026）を適用してアダルトコンテンツや海賊版関連ドメインを除去しています。このフィルタリングにより、コーパスは 1.2 兆ページから 7940 億ページに削減されます。ウェブ上で生成 AI コンテンツが蔓延しているため、独自の AI コンテンツ検出モデルでページにスコア付けを行い、広範な生成 AI コンテンツを持つドメインを特定するために手動検査も実施しています。これらのドメインはトレーニングコーパスから除外されます。

[...] Common Crawl も同じパイプラインで処理します。[...] フィルタリング、重複除去、独自ウェブコーパスとのマージ、そして最終的な完全 URL レベルおよびコンテンツレベルのファジー重複除去を経て、Common Crawl の部分は 242 億ページとなります。

これは全くうまく説明できておらず、少し皮肉なことに、私はこれを執筆した際に Microsoft Build コンファレンスに参加していたのです！発表前に深く掘り下げなかったことをお詫びします。

タグ: llm-release, generative-ai, ai, microsoft, llms, training-data

原文を表示

Microsoft announced two new text LLMs this morning - MAI-Thinking-1 (reasoning, 1T parameters, 35B active, available to "select early partners") and MAI-Code-1-Flash (137B Parameters, 5B active, "purpose-built for GitHub Copilot and VS Code to deliver high performance and lower cost [...] rolling out to GitHub Copilot individual users in Visual Studio Code"). I've not been able to try either of them just yet.

It's very interesting to see Microsoft releasing models with such low parameter counts, especially given how expensive larger models are to access right now. They claim MAI-Thinking-1 "is preferred to Sonnet 4.6 in our blind human side-by-side evaluations", which is impressive for a 35B model seeing as I frequently run models larger than that on my own laptop. (UPDATE: I got this entirely wrong, see note below.)

Also of note:

We trained [MAI-Thinking-1] from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.

And for MAI-Code-1-Flash as well:

It is built end-to-end by Microsoft using clean and appropriately licensed data.

I would *very much* like to learn more about this "appropriately licensed" data! Could these be the first generally useful code-specialist models that didn't train on an unlicensed dump of the web? (Update: the answer is no, see note below.)

Update: My initial published notes got the size of the models wrong. I misread Microsoft's announcements and interpreted the MoE active parameter count as the total parameter count, but the model card for MAI-Code-1-Flash lists it as 137B with 5B active and the MAI-Thinking-1 technical paper reveals it to be a 1T model with 35B active.

I deeply regret this error.

Update 2: That technical paper describes the training data in some detail from page 80 onwards. It has the same licensing problems as all of the other major LLMs: it's trained on a crawl of the public web:

The majority of our web HTML corpus comes from a proprietary crawl. After initial page discovery and selection, approximately 1.2 trillion pages are crawled and parsed. [...] In addition to Microsoft standard policy Sec. 2.4, we apply UT1 block list (Prigent, 2026) to remove adult content and piracy-related domains. In all, this filtering reduces the corpus from 1.2 trillion pages to 794 billion pages. Given the prevalence of AI-generated content on the web, we also score pages with a proprietary AI-content detection model and use manual inspection to identify domains with extensive AI-generated content; those domains are filtered out of the training corpus.
[...]
We process Common Crawl with the same pipeline. [...] After filtering, deduplication, merging with the proprietary web corpus, and a final round of exact-URL and content-level fuzzy deduplication, the Common Crawl portion contains 24.2 billion pages.

I did not cover this one at all well, which is somewhat ironic since I was at the Microsoft Build conference when I wrote this up! I'm sorry for not digging deeper before publishing my initial notes.

Tags: llm-release, generative-ai, ai, microsoft, llms, training-data

この記事をシェア

Simon Willison Blog重要度42026年7月17日 05:19

中国 Moonshot AI が超大規模モデル「Kimi K3」発表

MarkTechPost2026年7月20日 10:56

コミュニティが MiniCPM5-1B を微調整し、657MB の思考モデルを公開

MarkTechPost重要度42026年7月20日 07:20

Feyn AI が DB 事前検査型 Text-to-SQL モデル「SQRL」発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む