読み込み中…

Interconnects·2026年2月25日 01:06·約15分

中国のLLMにおいて蒸留は本当に重要なのか？

#LLM #Knowledge Distillation #Synthetic Data #Anthropic #API Security

TL;DR

AnthropicはDeepSeek、Moonshot、MiniMaxの3社がClaude APIを不正利用して大規模な蒸留キャンペーンを実施したと非難し、米中AI競争における技術流出の政治的・技術的実態を浮き彫りにした。

AI深層分析2026年4月27日 10:06

重要/ 5段階

深度40%

キーポイント

Anthropicの蒸留キャンペーン非難

AnthropicはDeepSeek、Moonshot、MiniMaxの3社がClaude APIを不正利用し、1,600万件以上のやり取りを通じて自モデルの能力を引き出していると公式に非難した。

蒸留と合成データの定義と実態

現在の「蒸留」は技術的な確率分布の一致ではなく、強モデルの出力を訓練データとする「合成データ」の利用を指し、現代AI研究における最も有用な改善手法の一つとなっている。

米中AI競争と技術流出の懸念

中国のラボが米国のAPIモデルから推論トレースなどを窃取しているという懸念は以前からあり、Geminiが推論機能の公開を停止した背景にもこの恐れがあった。

技術的限界と政治的な文脈

公式なAPIからは完全な知識蒸留に必要な情報が得られないため、不正な手法（ジェイルブレイク等）を用いた抽出が行われており、これは単なる技術問題を超えた安全保障上の懸案である。

ディストillationの二面性

蒸留は小規模モデル作成に有効な正規手法だが、競合他社が低コスト・短時間で先進モデルの能力を取得するための不正利用にもなり得る。

DeepSeekのAPI使用影響は限定的

DeepSeekによる15万件以上の交換数はモデル訓練規模から見て微小であり、V4モデルへの長期的な影響は無視できるレベルである。

Moonshot AIとMiniMaxの広範な利用

Moonshot AIは340万件以上の交換数を用い、エージェント推論やコーディングなど広範な能力の獲得を試みており、DeepSeekとは対照的に規模が大きい。

重要な引用

We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models.

Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis.

Distillation is a widely used and legitimate training method... But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost

In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment.

"The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem."

"It’s way easier to figure out getting access to 'banned' API models than it is to smuggle tens of thousands of physical GPUs and get them set up."

影響分析・編集コメントを表示

影響分析

このニュースは、米中間のAI技術覇権競争が単なるモデル性能の競い合いから、APIセキュリティや知的財産保護をめぐる法的・政治的対立へシフトしていることを示唆しています。Anthropicの公的な非難は、他社への警告効果だけでなく、規制当局や投資家に対する「不正な技術流出」の存在を可視化する意味を持ち、今後のAI開発におけるAPIアクセス管理の厳格化を促進する可能性があります。

編集コメント

Anthropicの公的な非難は、AI開発における「合成データ」の倫理的・法的境界線を再定義する重要な転換点となります。今後はAPIアクセスの監査機能や、蒸留検知技術の開発が競争の鍵となるでしょう。

蒸留は、米中および AI の技術拡散に関するより広範な物語において、最も頻繁に議論されるトピックの一つとなっています。蒸留という用語には多くの定義がありますが、今日一般的な意味では、強力な AI モデルの出力を用いて、より弱いモデルを訓練することを指します。この言葉自体は、知識蒸留（Hinton, Vinyals, & Dean 2015）というより技術的かつ具体的な定義に由来しており、これは教師モデルの確率分布を一致させるように学習する特定の方法を伴うものです。

今日の蒸留は、一般的には合成データとしてより適切に説明できます。強力なモデルからの出力（通常は API を経由して取得）を取り、その出力を予測するようにモデルを訓練します。知識蒸留の技術的な形式は、API モデルからは実際には不可能です。なぜなら、それらはユーザーに対して必要な情報を公開していないからです。

合成データは、今日 AI 研究者が日常的にモデルを改善するために使用する単一で最も有用な方法と言えるでしょう。もちろん、アーキテクチャは重要であり、一部のデータには依然として人間の入力が必要であり、スケール可能な検証可能な報酬を用いた強化学習のような新しいアイデアが業界を変革する可能性もありますが、今日モデルを改善する日々の生活の多くは、合成データをどのように適切に捕捉し、スケーリングするかを見極めることにあります。

この記事の冒頭から示された点を具体化するために、繰り返し主張されているのは、中国の主要なラボが自社のモデルに蒸留（distillation）を活用し、米国の最上位 API ベースの競合他社から能力を盗んでいるという点です。これまでに最も注目された事例は、DeepSeek R1 のリリースを巡るものであり、OpenAI は DeepSeek が API をハッキングして推論トレース（reasoning traces）を盗んだと非難しました（デフォルトでは公開されていません。文脈として、推論トレースとは、オープンウェイトの推論モデルがユーザーに公開する内部推論プロセスなどを指す慣用的な用語です）。蒸留への恐怖は、Gemini が推論トレースをユーザーに公開することから隠すことに急転した理由でもあったでしょう。Gemini に基づいて構築された非常に注目度の高い初期の推論研究さえ存在しました！

これらすべてが、今日のニュースへと繋がります。Anthropic は中国の複数のラボに対して、Claude モデルに対する精巧な蒸留キャンペーンを執り行ったと名指しで直接非難しました。これは複雑な問題です。本稿では一連の問いを解き明かします。影響から始まり、政治に至るまでです。核心的な問いは、米国のモデルからの蒸留によって中国のラボがどの程度の性能向上を得ているのかという点です。

Interconnects AI は読者支援型の出版物です。購読をご検討ください。

まず、Anthropic が共有した内容を振り返りましょう。ブログ記事より、強調部分は私によるものです：

私たちは、DeepSeek、Moonshot、MiniMax の 3 つの AI ラボが、自社のモデルを改善するために Claude の能力を不正に抽出する産業規模のキャンペーンを実行していることを特定しました。これらのラボは、約 24,000 の不正なアカウントを通じて Claude と 1600 万回以上のやり取りを行い、利用規約および地域アクセス制限に違反しました。

これらのラボは「蒸留（distillation）」と呼ばれる手法を使用しました。これは、より能力の低いモデルを、より強力なモデルの出力に対して訓練する技術です。蒸留は広く使用されており正当なトレーニング方法の一つです。例えば、最先端 AI ラボは顧客向けにより小さく安価なバージョンを作成するために、自社のモデルを定期的に蒸留しています。しかし、蒸留は不正目的にも利用され得ます。競合他社はこれを利用して、独自に開発する場合に必要な時間やコストのほんの一部で、他のラボから強力な能力を獲得することができます。

モデルそのものと同様に、蒸留（distillation）の恩恵も非常にばらつきがあります。特定の能力においては、特にそれに対応する完全なトレーニングパイプラインが整っていない場合、その分野で最先端のモデルからデータを迅速に蒸留することで、劇的な性能向上をもたらすことがあります。これにより、API から蒸行を行うラボが、そうでない場合に比べてはるかに速く追いつく手助けを確実にします。多くの蒸行は比較的穏やかで、LLM の多数のトークンを用いて既存データを処理・洗練させることに用いられ、少数の高品質なトレーニングトークンを得るために大量の計算リソースを投入するものです。このような生データ処理作業は様々な API で実施可能ですが、一つの API が最も優れている傾向があります。

Anthropic が中国の LLM 構築者 3 社が実際に Claude API を何のために使用したかを述べている内容について触れると（余談ですが、Anthropic は攻撃が API を通じて行われたのか、チャットアプリ経由なのか、それとも Claude Code 経由なのかは確認していません）、これらの操作の実際の影響は非常に混在しています。これらのラボが他のプロジェクトや他の米国製モデルのために追跡されていない利用をどの程度展開したかを知ることは困難です。

まず、Anthropic はブログ記事で DeepSeek を最初に挙げています。これは中国 AI において米国で最も知られた名前であるためです。その使用範囲は実際には非常に小さく、この投稿が詳細よりも大局的な視点に焦点を当てていることがわかります：

DeepSeek

規模：150,000 回以上のやり取り

対象となった操作：

多様なタスクにおける推論能力

Claude が強化学習のための報酬モデルとして機能するためのルブリックに基づく評価タスク

ポリシーに敏感なクエリに対する検閲回避の代替案作成

言語モデルのトレーニング規模において、150K のサンプルは実質的な実験としてはまだ表面をなぞっている程度に過ぎません。彼らは何らかのルブリック（評価基準）を実験していたようですが、これはオンライン RL 実行のためのものであった可能性があります。しかし、アクセスが極めて分散されていたことを考えると、それは非常に unlikely です。また、敏感なクエリに対する補完処理に関するいくつかの小さな事柄も含まれていました。Anthropic の API を使用したこの利用は、DeepSeek の長年噂されている V4 モデル（あるいはここで提供されたデータが貢献したとされるいずれかのモデル）に対して、無視できるほどの影響しか与えないでしょう。また、これはおそらく DeepSeek 内の小規模チームによるものであり、より広範なトレーニング組織の多くには知られていなかった可能性が高いです。

残りの2つのラボ、Moonshot AI（Kimi モデルの開発元）と MiniMax は、はるかに広範な利用を示していました。

Moonshot AI

規模：340 万を超えるやり取り

対象となった運用:

エージェント推論およびツール使用

コーディングおよびデータ分析

コンピュータ使用型エージェント開発

コンピュータビジョン

MiniMax

規模：1300 万を超えるやり取り

対象となった運用:

エージェントコーディング

ツールの使用とオーケストレーション

蒸留の役割は絶えず変化しています。今日の Claude からそのエージェントとしての振る舞いを蒸留することは、過去に教師として機能してきた Claude のバージョンよりもはるかに価値があります。Claude Opus 4.6 は、他のどのモデルも完全に匹敵しない、バランスの取れたエージェントナビゲーションを備えています。なぜ、モデル出力の一部をトレーニングデータとして使用し、自社のモデルがそれを吸収できるか試してみないのでしょうか？今後数ヶ月で、この差別化は薄れていくでしょう。まるですべてのモデルが今日では数学においてほとんどの人が必要とする以上に優れているようなものであり、蒸留の源となる場所は山ほどあります。

見積もりにはばらつきがありますが、各レスポンスが 1 回のやり取りあたり 10,000〜25,000 トークンであったと仮定すると、この 2 つの研究機関（主に MiniMax と連携）全体での総トークン数は 1,500 億〜4,000 億トークンに達します。これは非常に大量のデータであり、モデルのポストトレーニングにおいて意味のある改善をもたらす可能性があります。例えば、Olmo 3 では、同様の方法で構築された 200 億トークンの SFT（Supervised Fine-Tuning: 教師あり微調整）データセットが存在しましたが、それを 10 倍に増やすことは非常に妥当な施策となります。

これらの数字は、米国の企業がホストする API を通じた合成データ生成の全体像のごく一部に過ぎません。同時に、数量は影響を測る非常に粗い指標です。Claude の出力を取得し、それをモデルパイプラインに追加する方法を考えることさえ容易ではありません。研究コミュニティでは、特定の教師モデルからの出力を取得した結果、予期せぬ形で学生モデルのパフォーマンスが低下する事例が数多く報告されています。データ間の微妙な相互作用により、この種の知識蒸留は可変的で困難を伴います。これは本質的に研究課題です。

これが中国のラボで革新が行われている領域です。中国のフロンティアラボは西側の競合他社よりも大幅に効率的であるという主張がありますが、これは誤解を招くものです。

各ラボが直面する制約は異なります。中国のラボはリソースが限られているためやむを得ずわずかに効率的になっている可能性がありますが、人材へのアクセスに関する全体的な状況は非常に似ています。また、中国のラボはベンチマークに対するアプローチが異なり、実際よりも近い位置にあるように見せかけ（さらには潜在的に凌駕しているかのように見せる）ことで、AI 市場での勢いとブランド認知度を獲得しようとしています。

中国のラボは、GPU へのアクセス制限があるため、主要な API モデルからの知識蒸留において大きな革新を遂げている可能性があります。GPU は合成データの構築に使用できますが、研究計算資源に対する支出よりも資金力がありながら供給制約に直面している組織にとっては、API ベースのモデルを利用することが、より多くの計算リソースを実効的に獲得するための数少ない選択肢の一つです。「禁止された」API モデルへのアクセスを得る方法を考えることは、数万台の物理 GPU を密輸してセットアップするよりもはるかに容易です。

このような運用を行うのは中国のラボだけではありません。所有していないモデルからの合成データは、すべてが知識蒸留とみなすことができます。知識蒸留は、誰にとってもより多くの計算リソースを得るための近道です。また、研究用の大規模クラスターを構築するには巨額の資金コミットメントが必要となる一方、API は従量課金制であるため、リスクの低いコストで済みます。例えば、Olmo 3 では合成データのために Frontier スーパーコンピュータ上で数百万 GPU 時間と、NAIRR を通じた Azure クレジットを使用しました。GPU の同等物（あるいは現金）は持っていませんでした（研究クレジットのおかげです！）。

これらを総合すると、Anthropic がこれを懸念するのは非常に妥当なことです。しかし、依然として中国のラボのポストトレーニング能力においてこれが決定的要因であるとは言い切れません。特に、米中のパフォーマンス格差のように、蒸留元のモデルに追いつくまでの時間差で測定しやすい要因でもないでしょう。

一歩引いて考えてみると、Claude Sonnet が Opus よりもフラッグシップモデルだった時期さえありました（おそらく Sonnet 3.5 の頃でしょうか）。この多くは、Opus のチェックポイントから内部でよく蒸留されたことによるものです。迅速な反復と高品質なデータは非常に大きな効果を持ち、生徒モデルが教師モデルを上回ることも可能になります。最先端研究所はこの点を有利に利用し、合成データの生成には社内限定モデルを使用していますが、「中国のモデルはデータ蒸留のために米国の最先端を越えることができない」と言うのは、Claude Sonnet が Opus に勝つことが決してないと言うのと同じです。それは unlikely であり、リリース時期にも大きく依存しますが、AI モデルが劇的な進歩を遂げる中で、このような奇妙な出来事が実際に起こり得るのです。

ここで未解決の最大の要因は、最良モデルを訓練するために大規模な強化学習（RL）が必要となる時代において、より強力な教師モデルからの蒸留が困難であるという点です。プロンプトを注意深く作成・フィルタリングするために計算リソースを費やすことはできますが、それでも実質的なオンポリシー推論（on-policy inference）を用いて自らモデルを訓練する必要があります。生成は RL における計算コストの大部分を占めており、それは他のモデルからの生成では代用できません。このため、私はこの話題が少し沈静化するだろうと予想していました。しかし、彼らの公開研究から明らかなように、中国の研究所は計算リソースの不足にもかかわらず、優れた強化学習インフラストラクチャを持っています。

私がこれが薄れていくと予想した理由は、モデルの蒸留を「競争目的」で行うことが許可されていないことが、API モデルの利用規約に違反している状態がかなり前から続いているからです。米国の研究者やオープンモデル開発者は以前、これについて非常に懸念し議論していました（私は 2022 年と 2023 年に何度もこの点について執筆しています）。しかし、コミュニティにおけるその懸念が収まったのは 2024 年後半になってからであり、小規模なモデル開発者に対する何らかの措置が取られたわけではありません。

Anthropic の今回の行動は、AI を巡る地政学的緊張をさらに高めるための継続的な一歩を表しています。モデル蒸留を制限することは、GPU などの物理製品の輸出を規制するよりもはるかに困難です。多くの点で、分散アクセス手法を通じて蒸留を完全に制限するのはほぼ不可能に思え、GPU の販売を制限する方がはるかに大きな影響を与えるでしょう。

Anthropic と AI 業界は、戦うべき戦場を選ぶ必要があります。最良のモデルに対する API エンドポイントが利用可能になれば、他の組織はその情報を利用して当該モデルの変種を訓練します。これは AI モデルにおける自然な進化です。もし AI モデルがあまりにも貴重で、蒸留（distillation）が極端なリスクとなるなら、そのモデルはファーストパーティ製品に限定されることになります。Anthropic は最新のモデルにおいてこの道を選ぶ選択肢を持っています。API ベースのモデル代替品市場は競争が激しく、一部の企業はこの道を進む可能性があります — 中国製モデルによる価格破壊の影響も一部あるでしょう — しかし、API は主要なラボがすぐに撤回するリスクを負うような基本的な提供物です。

原文を表示

Distillation has been one of the most frequent topics of discussion in the broader US-China and technological diffusion story for AI. Distillation is a term with many definitions — the colloquial one today is using a stronger AI model’s outputs to teach a weaker model. The word itself is derived from a more technical and specific definition of knowledge distillation (Hinton, Vinyals, & Dean 2015), which involves a specific way of learning to match the probability distribution of a teacher model.

The distillation of today is better described generally as synthetic data. You take outputs from a stronger model, usually via an API, and you train your model to predict those. The technical form of knowledge distillation is not actually possible from API models because they don’t expose the right information to the user.

Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis. Yes, architecture is crucial, some data still needs exclusively human inputs, and new ideas like reinforcement learning with verifiable rewards at scale can transform the industry, but so much of the day to day life in improving models today is figuring out how to properly capture and scale up synthetic data.

To flesh out the point from the start of this piece, the argument has repeatedly been that the leading Chinese labs are using distillation for their models to steal capabilities from the best American API-based counterparts. The most prominent case to date was surrounding the release of DeepSeek R1 — where OpenAI accused DeepSeek of stealing their reasoning traces by jailbreaking the API (they’re not exposed by default — for context, a reasoning trace is a colloquial word of art referring to the internal reasoning process, such as what open weight reasoning models expose to the user). Fear of distillation is also likely why Gemini quickly flipped from exposing the reasoning traces to users to hiding them. There was even very prominent, early reasoning research that built on Gemini!

This all leads us to today’s news, where Anthropic named and directly accused a series of Chinese labs for elaborate distillation campaigns on their Claude models. This is a complex issue. In this post we unpack a series of questions, beginning with the impact, and ending with politics. The core question is — how much of a performance benefit do Chinese labs get from distilling from American models.

Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

To start, let’s review what Anthropic shared. From the blog post, emphasis mine:

We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions.

These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.

Much like the models themselves, the benefits of distillation are very jagged. For some capabilities, particularly if you don’t have a full training pipeline setup for it, quickly distilling some data from the leading frontier model in that area can yield massive performance boosts. This can definitely help the lab distilling from the API catch up much more quickly than they otherwise would. Most distillation is rather benign, using many tokens of an LLM to help process and refine existing data — putting a lot of compute into getting a few, high quality training tokens out. This sort of raw data processing work can be done on many different APIs, but one tends to be best.

When we go into what Anthropic says the three Chinese LLM builders actually used the Claude API for — as an aside, Anthropic didn’t confirm that the attack was done through the API, the chat app, or Claude Code — the actual impact of the operations is very mixed. It’s hard to know how much untracked usage these labs deployed for other projects (or other American models).

To start, Anthropic puts DeepSeek first in their blog post because they’re the household name in the US for Chinese AI. The extent of their use is actually quite small, showing how this post is more about the big picture than the details:

DeepSeek

Scale: Over 150,000 exchanges

The operation targeted:

Reasoning capabilities across diverse tasks

Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning

Creating censorship-safe alternatives to policy sensitive queries

In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment. It looks like they were experimenting with some rubrics, which could’ve been for an online RL run, but that’s extremely unlikely with how distributed the access was, and then some minor stuff on completions for sensitive queries. This usage of Anthropic’s API will have a negligible impact on DeepSeek’s long-rumored V4 model (or whichever model the data here contributed to). This was also very likely a small team at DeepSeek and unknown to much of the broader training organization.

The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage.

Moonshot AI

Scale: Over 3.4 million exchanges

The operation targeted:

Agentic reasoning and tool use

Coding and data analysis

Computer-use agent development

Computer vision

MiniMax

Scale: Over 13 million exchanges

The operation targeted:

Agentic coding

Tool use and orchestration

The role of distillation is constantly changing. Distilling from Claude today for its agentic behavior is much more valuable than versions of Claude have been as a teacher in the past. Claude Opus 4.6 has a well-rounded agentic navigation that none of the other models quite match. Why not try training on some of the model outputs to see if your model absorbs it? Over the next few months, that’ll be less differentiated. It’s sort of like how all the models are way better at math today than most people need — there are plenty of places to distill from.

Estimates will vary, but if each response had 10-25K tokens per exchange, the total tokens across these two labs, mostly with MiniMax, would be 150-400 billion tokens. This is a substantial amount, which could meaningfully improve a models’ post-training. For example, in Olmo 3 we had an SFT dataset of 20 billion tokens that could be built like this, and increasing it by 10X would be very reasonable.

These numbers are just scratching the surface of total synthetic data generation across APIs hosted by US companies. At the same time, quantity is a pretty crude way to measure impact. Just taking the outputs from Claude and figuring out how to add them to your model pipeline isn’t easy. The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem.

This is what I’m sure the Chinese labs are innovating at. There’s an argument that Chinese frontier labs are substantially more efficient than their Western counterparts — this is misleading.

The labs operate under different constraints. The Chinese labs are likely slightly more efficient out of necessity in being lower on resources, but overall the picture of talent access is very similar. The Chinese labs also approach benchmarks differently, making it appear that they’re a bit closer than they really are (and appearing as if they’re potentially surpassing). This is needed to get momentum and brand recognition in the AI market.

The Chinese labs likely innovate greatly on distilling from leading API models, due to their restricted access to GPUs. GPUs could be used to construct synthetic data, but for organizations with more funding than they can spend on research compute (being supply limited), using API-based models is one of the few other options for effectively getting more compute. It’s way easier to figure out getting access to “banned” API models than it is to smuggle tens of thousands of physical GPUs and get them set up.

It’s not only the Chinese labs that operate like this. Synthetic data from a model you don’t own is all arguably distillation. Distillation is a shortcut to more compute for anyone. It’s also a far less risky cost, as having a big cluster for research requires a very large financial commitment, where APIs are pay-as-you-go. For example, in Olmo 3 we used millions of GPU hours on the Frontier supercomputer and Azure credits through NAIRR for synthetic data. We didn’t have the equivalent in GPUs (or really the cash, thank you research credits!).

All together, it’s very fair for Anthropic to be concerned about this. I still wouldn’t say it is a crucial factor in these Chinese labs post-training capabilities, especially not one that’ll be easy to measure in a time gap to matching the model they’re distilling from a la the US-China performance lag.

If we take a step back, there was even a time when Claude Sonnet was the flagship model ahead of Opus (I think this was with Sonnet 3.5), much of this comes from it being well distilled internally from Opus checkpoints. Fast iteration and high-quality data can go very far, letting student models surpass the teacher. Frontier labs use this to their advantage, by having internal-only models for generating synthetic data, but saying that Chinese models could never pass the US frontier due to data distillation is like saying that Claude Sonnet could never beat Opus. It's unlikely, and it depends a lot on release times, but with AI models making dramatic progress, weirder things like this have already literally happened.

The biggest factor unaddressed here is how distillation from stronger teacher models is harder in an era when reinforcement learning at scale is needed to train the best models. You can spend compute carefully crafting and filtering prompts, but you still need to train the model yourself with substantial, on-policy inference — generation is the majority of the compute cost for RL and it can’t be generations from another model. For this reason, I expected this story to die down a bit. It’s clear from their open research that Chinese labs have excellent RL infrastructure, despite the compute shortages.

The reason I expected it to fade is that not being allowed to distill models for “competitive purposes” has violated the terms of service for API models for quite some time. Academics and open model builders in the US used to greatly worry about and debate this (and I’ve written about it multiple times in 2022 and 2023). Only later in 2024 did that worry die down in the community (and no action has been taken against any smaller model builders).

This action from Anthropic represents another continued step ratcheting up the AI geopolitical tension. Kneecapping model distillation will be far harder than restricting the shipments of physical goods like GPUs. In many ways it seems like fully restricting distillation through distributed access methods seems almost impossible, and restricting GPU sales would be far more impactful.

Anthropic and the AI industry should choose their battles. When API endpoints are available for the best models, other entities will use that to train variants of said model. This is a natural evolution of AI models. If AI models are so precious that distillation is an extreme risk, then the models will be restricted to first-party products. Anthropic has a choice to do this with their latest models. The market for API-based model alternatives may be so competitive that some companies go this path — likely in part due to Chinese models undercutting on price — but an API is a fundamental offering that no leading lab will risk walking back from anytime soon.

この記事をシェア

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

Simon Willison Blog重要度42026年7月25日 08:48

Anthropic、新モデル「Claude Opus 5」を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Interconnects·2026年2月25日 01:06·約15分

中国のLLMにおいて蒸留は本当に重要なのか？

#LLM #Knowledge Distillation #Synthetic Data #Anthropic #API Security

TL;DR

AI深層分析2026年4月27日 10:06

重要/ 5段階

深度40%

キーポイント

Anthropicの蒸留キャンペーン非難

AnthropicはDeepSeek、Moonshot、MiniMaxの3社がClaude APIを不正利用し、1,600万件以上のやり取りを通じて自モデルの能力を引き出していると公式に非難した。

蒸留と合成データの定義と実態

米中AI競争と技術流出の懸念

技術的限界と政治的な文脈

ディストillationの二面性

蒸留は小規模モデル作成に有効な正規手法だが、競合他社が低コスト・短時間で先進モデルの能力を取得するための不正利用にもなり得る。

DeepSeekのAPI使用影響は限定的

DeepSeekによる15万件以上の交換数はモデル訓練規模から見て微小であり、V4モデルへの長期的な影響は無視できるレベルである。

Moonshot AIとMiniMaxの広範な利用

Moonshot AIは340万件以上の交換数を用い、エージェント推論やコーディングなど広範な能力の獲得を試みており、DeepSeekとは対照的に規模が大きい。

重要な引用

We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models.

Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis.

Distillation is a widely used and legitimate training method... But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost

In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment.

"The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem."

"It’s way easier to figure out getting access to 'banned' API models than it is to smuggle tens of thousands of physical GPUs and get them set up."

影響分析・編集コメントを表示

影響分析

編集コメント

Interconnects AI は読者支援型の出版物です。購読をご検討ください。

まず、Anthropic が共有した内容を振り返りましょう。ブログ記事より、強調部分は私によるものです：

DeepSeek

規模：150,000 回以上のやり取り

対象となった操作：

多様なタスクにおける推論能力

Claude が強化学習のための報酬モデルとして機能するためのルブリックに基づく評価タスク

ポリシーに敏感なクエリに対する検閲回避の代替案作成

残りの2つのラボ、Moonshot AI（Kimi モデルの開発元）と MiniMax は、はるかに広範な利用を示していました。

Moonshot AI

規模：340 万を超えるやり取り

対象となった運用:

エージェント推論およびツール使用

コーディングおよびデータ分析

コンピュータ使用型エージェント開発

コンピュータビジョン

MiniMax

規模：1300 万を超えるやり取り

対象となった運用:

エージェントコーディング

ツールの使用とオーケストレーション

原文を表示

Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

To start, let’s review what Anthropic shared. From the blog post, emphasis mine:

DeepSeek

Scale: Over 150,000 exchanges

The operation targeted:

Reasoning capabilities across diverse tasks

Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning

Creating censorship-safe alternatives to policy sensitive queries

The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage.

Moonshot AI

Scale: Over 3.4 million exchanges

The operation targeted:

Agentic reasoning and tool use

Coding and data analysis

Computer-use agent development

Computer vision

MiniMax

Scale: Over 13 million exchanges

The operation targeted:

Agentic coding

Tool use and orchestration

This is what I’m sure the Chinese labs are innovating at. There’s an argument that Chinese frontier labs are substantially more efficient than their Western counterparts — this is misleading.

この記事をシェア

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

Simon Willison Blog重要度42026年7月25日 08:48

Anthropic、新モデル「Claude Opus 5」を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む