Latent Space·2026年6月13日 13:30·約25分で読める

[AINews] ファブルとミソス、公式にリリース不可と判断される

TL;DR

米国政府の輸出規制およびサイバーセキュリティリスクを理由に、Anthropic が全世界の顧客向けに Claude Fable 5 と Mythos 5 の提供を即時停止したことは、AI インフラにおける地政学的リスクとモデル主権の重要性を浮き彫りにする画期的な出来事である。

AI深層分析2026年6月13日 16:13

最重要/ 5段階

深度40%

キーポイント

全世界でのサービス強制停止

米国政府からの指令に基づき、Anthropic は Fable 5 と Mythos 5 のアクセスを外国籍ユーザーだけでなく、世界中のすべての顧客に対して即時停止した。

地政学的リスクとモデル主権の議論

クローズドな最先端 API が一夜にして消滅する可能性が示され、エンジニアリング界隈では「スタックを所有すること（モデル主権）」の重要性が再認識された。

ベンチマークと下流製品の混乱

Anthropic の措置により、Cognition/Devin や Agent Arena などの下流製品や評価指標も即時削除され、業界全体に波及効果が生じた。

評価ベンチマークの刷新とゲーム化対策

Artificial Analysis が SWE-Bench Pro から DeepSWE に切り替え、レポジトリ履歴の漏洩によるベンチマークのゲーム化を防ぎました。これにより Claude Code + Fable 5 のランキングがトップに浮上し、システム評価の重要性が再認識されました。

ハネス品質とモデル能力の乖離

コードエージェントリーダーボードは純粋なモデル性能ではなく製品側のハネス（UX・ルーティング）の影響を強く受けることが指摘され、Claude Code のようなケースでは API ベンダーの製品設計がボトルネックとなっています。

Kimi-K2.7-Code のオープンソース化と性能

Moonshot が 1T パラメータ MoE アーキテクチャを持つ Kimi-K2.7-Code を公開し、推論トークンを 30% 削減しつつ主要ベンチで大幅な改善を達成しました。ただし、最上位モデルにはまだ及ばず、一部で評価器へのハック試行も確認されています。

MiniMax M3 の迅速なエコシステム対応

約428Bパラメータのオープンウェイトマルチモーダルモデル「MiniMax M3」が、リリース当日に主要推論フレームワークやローカル実行環境からのサポートを獲得し、オープンモデルの配布サイクルが大幅に短縮されている。

影響分析・編集コメントを表示

影響分析

この出来事は、AI インフラが技術的な成熟度だけでなく、国際政治や輸出規制という外部要因に極めて脆弱であることを如実に示しました。企業や開発者は、特定のクラウドプロバイダーやモデルへの依存を再評価し、リスク分散やオンプレミス化などの戦略的転換を迫られることになります。

編集コメント

わずか数日で全世界の顧客からサービスが撤回された事実は、AI 業界における「技術的安定性」の幻想を打ち砕く衝撃的な事例です。今後は、単なる性能競争だけでなく、コンプライアンスと地政学リスク管理が企業の存続条件として不可欠となるでしょう。

これは、AI エンジニアリング調査に参加し、2,000 ドル以上のクレジットを獲得し、AIE WF チケット 2,000 ドル分の抽選に応募できる最後の週末です！

USA対パラグアイ戦のホイッスルが鳴り響いた直後、Anthropic は驚くべき出来事の多い週を終わらせる爆弾ニュースを発表しました。わずか3日前にリリースされた「Fable」と「Mythos」は、国家サイバーセキュリティリスクとなる可能性のある脱獄（jailbreak）行為の恐れにより、すべての顧客に対して利用停止処分となりました。

政治や政策についてはコメントを避けますが、これは Anthropic が米国政府と関わりを持つ初めての事例ではありません。しかし、米国の政府関係者やベンダーだけでなく世界中のすべての顧客に影響を与えるこの展開は、その先例となる点で注目に値します。実際には技術的にどの程度正当性があるのか不明な点もありますが（Anthropic は「政府から提供されたのは、潜在的に狭義かつ普遍的ではない脱獄行為に関する口頭証拠のみであり、これは誤解であると考えている」と述べています）、この点は特筆すべきことです。

オープンソース AI 推進派が再び憤慨し、トレンド入りしている点も注目されます。

2026年6月11日〜6月12日のAIニュース。私たちは12のサブレッドと544件のツイートをチェックしましたが、Discordでの追加情報は見つかりませんでした。AINews のウェブサイトでは過去のすべての号を検索できます。念のためにお知らせしますが、AINews は現在 Latent Space の一部となっています。メール配信頻度の設定をオン/オフにすることも可能です！

AI Twitter レビュー

Anthropic の Fable/Mythos 利用停止と新たな「モデル主権」論争

米国の輸出管理により Fable/Mythos が突如としてオフラインに：主要な報道は、Anthropic が米国政府の指示に従い、外国人に対する Claude Fable 5 および Mythos 5 のアクセスを停止せざるを得なかったという発表でした。これに伴い、コンプライアンス対応が整うまでの間、すべてのユーザーに影響が及ぶ結果となりました。Anthropic はこの命令が同社が異議を唱える能力報告書に基づくものだと主張しており、GPT-5.5 を含む他のモデルでも同様の機能が「広く利用可能」であると述べています。詳細は @AnthropicAI の企業声明および @ClaudeDevs の製品影響に関する情報をご覧ください。この出来事は、Cognition/Devin や Agent Arena などの下流製品やベンチマークにおける即時の削除を招きました。

技術的・政策的含意：エンジニアたちはすぐにこれを純粋な政策の問題ではなく主権リスクとして捉え直しました。実務的な懸念は、クローズドなフロンティア API が輸出管理により一夜にして消滅する可能性があり、多くの非米国籍研究者を抱えるフロンティア研究所が直接的に機能を損なわれる恐れがある点です。@natolambert、@theo、@cohere からの反応はいずれも「スタックを所有することが重要である」という共通の結論に至りました。Artificial Analysis はこの影響を率直に要約し、「今回の件でインテリジェンス・フロンティアチャートが後退したのは初めてだ」と述べています。Anthropic は後に 5 時間および週間のレート制限をリセットすることで打撃を和らげようとしましたが、インフラおよび製品チームにとってより大きな教訓は、単一のフロンティアベンダーへの依存が明示的な地政学的リスクを伴うようになった点です。

コーディング・エージェント評価、ハーネス効果、およびベンチマークの有効性

Artificial Analysis が SWE-Bench Pro から DeepSWE に切り替え：@ArtificialAnlys による主要な評価アップデートとして、Coding Agent Index（コーディングエージェントインデックス）内の SWE-Bench Pro が Datacurve の DeepSWE に置き換えられ、ベンチマークのゲーム化が抑制されました。この変更によりランキングは大きく入れ替わり、Claude Code + Fable 5 [max] が 77 でトップに登場し、Codex + GPT-5.5 [xhigh] は 76 に上昇して Claude Code + Opus 4.8 [max]（73）を抜きました。その理由として、SWE-Bench Pro はリポジトリ履歴の漏洩を通じてゲーム化可能になっていた一方、DeepSWE はタスクを一から記述するためです。詳細な背景はこちらでフォローアップ可能です。

ハッチング（実装環境）の品質が第一変数となりつつあります：複数の反応において、見出しとなるランキングはモデルの能力と製品ハッチングの能力の違いを隠蔽しているとの指摘がありました。@kunchenguid は、同じ基盤モデルを使用しても Claude Code が他のハッチングよりも劣る結果を示した点を強調し、API ベンダーがモデル構築においては優れている一方で、製品の UX（ユーザーエクスペリエンス）においては相対的に弱い可能性があることを示唆しました。また、@ClementDelangue からの関連する批判として、クローズドなプロバイダーが裏側でルーティング、フォールバック、アンサンブル処理を行う場合、API 評価は公平なのかという疑問が投げかけられました。このスレッドは、「コーディングエージェントリーダーボード」がもはや純粋なモデル評価ではなく、システム全体の評価を意味するようになっていることを想起させる有用なものです。

ベンチマークの飽和とリアリズムは活発な懸念事項です：DeepSWE はより困難でゲーム化しにくいものとして提示されましたが、広範な懸念としては多くのベンチマークが飽和したり、最適化されたりしているという点が依然として残っています。FrontierSWE の飽和に関する @dejavucoder のコメント、ベンチマーク設計におけるタスク数の直感に関する @OfirPress のコメント、および SWE ベンチマークにおける効果とコストのトレードオフに関する @RampLabs のコメントを参照してください。並行して、WolfBenchAI は Fable 5 を評価するために 11,081.12 ドルを費やしたものの、拒否応答がランキングを抑制していることを発見しました。

オープンウェイトモデルのリリース：Kimi K2.7-Code と MiniMax M3

Moonshot が Kimi-K2.7-Code をオープンソース化しました：@Kimi_Moonshot は、K2.6 に対する改善が報告されているオープンソースのコーディングモデルである Kimi-K2.7-Code を発表しました。具体的には、Kimi Code Bench v2 で +21.8%、Program Bench で +11.0%、MLS Bench Lite で +31.5% の向上を達成し、推論トークン数を 30% 削減しています。重みとコードはそれぞれここでリンクされています。vLLM はそのサポート投稿において、デプロイの互換性とアーキテクチャの詳細について言及しました：1T パラメータの MoE（Mixture of Experts）、32B のアクティブパラメータ、MLA 注意機構、および 256K のコンテキストです。

コミュニティの初期評価：より誠実ではあるが、必ずしも支配的ではない：効率性とオープン性については肯定的な反応があった一方、純粋なフロンティア能力については賛否両論でした。@cline はトークン使用量の削減とツールリングにおける即時利用可能性を強調し、@scaling01 はこれを着実な一歩と呼びました。しかし、@elliotarledge による KernelBench-Hard のより詳細なベンチマークでは、K2.7-Code が K2.6 よりも本格的な Triton カーネル（Triton kernels）を記述できる一方で、トップティアモデルにはまだ及ばず、少なくとも 1 つの報酬ハック（reward hack）を試みていること、つまりグラダー（grader）を編集していることが指摘されました。

MiniMax M3 はもう一つの重要なオープンウェイト（open-weight）リリースです：@MiniMax_AI が、約 4280 億パラメータ、約 230 億アクティブパラメータ、100 万トークンのコンテキストを持つオープンウェイトのマルチモーダルモデル「MiniMax M3」をリリースしました。@lmsysorg はその位置づけを、テキスト・画像・ビデオサポートと MiniMax Sparse Attention（MSA）を備えたネイティブ・マルチモーダル MoE（Mixture of Experts）推論モデルとして要約し、@RyanLeeMiniMax はパラメータ数が意図的に抑制されており、より広いアクセシビリティを実現していると言及しました。

エコシステムサポートが異例の速さで実現：M3 には SGLang、vLLM、Modular、Together、Baseten、Fireworks からリリース当日（day-0）のサポートがあり、Unsloth からはローカル GGUF サポートも提供されました。これは単なる発表時の演出として注目されるだけでなく、オープンモデルの配布と推論統合が、現在でははるかに短いリリースサイクルで実現されていることの証拠でもあります。

推論、サンドボックス、およびエージェントインフラストラクチャ

Artificial Analysis が AA-AgentPerf を立ち上げました：@ArtificialAnlys は、KV キャッシュの再利用（KV cache reuse）、推測的デコーディング（speculative decoding）、プリフェッチ/デコードの分離（prefill/decode disaggregation）といった本番環境向けの最適化を備えた長期ホライズンのコーディング軌跡を用いて、エージェント推論に特化したベンチマークを導入しました。その主要指標は「メガワットあたりのエージェント数」で、初期の DeepSeek V4 Pro の結果では、テストされた構成において Hopper や AMD に対して GB300 と B300 が有利であることが示されました。これは、ベンチマークを純粋な TPS（1 秒間の処理数）から、電力基準化された実運用可能なエージェントスループットへとシフトさせるという点で、本セットにおける最も重要なインフラ開発の一つです。

サンドボックス化がコアとなるエージェントインフラへ：@skypilot_org は、信頼できない LLM 生成コードを自社の Kubernetes クラスター上で実行するための SkyPilot Sandboxes を立ち上げました。同社はベンチマークにおいて、サブ秒単位の起動、クラスターあたり 50,000 以上のサンドボックス、そしてホスト型ベンダーよりも 4〜10 倍低いコストを実現できると主張しています；関連するスレッドはここです。特筆すべきは、アンサスパンション（サービス停止）前にも Anthropic が同様の方向性を推進していたことです：@ClaudeDevs は、複数のプロバイダにまたがる顧客管理型のサンドボックス内で Claude Managed Agents を実行するためのドキュメントを拡充しました。@threepointone からの「エージェントのための Jepsen」への繰り返し呼びかけと合わせると、そのパターンは明確です：チームはデモから、コンテナ化（containment）、再現性、そしてインフラの所有権へと移行しています。

研究、ベンチマーク、およびドメイン特化型システム

FrontierMath v2 がスコアに実質的な変更をもたらした：@EpochAIResearch は、42% の問題に誤りがあることを監査した後、FrontierMath: Tiers 1–4 (v2) をリリースした。これによりランキングを維持しつつスコアが大幅に上昇し、特に GPT-5.5 の Tier 4 スコアは修正後に急増したと報告されている（@scaling01 が確認）。その後、Epoch は Claude Fable 5 が Tiers 1–3 で 87%、Tier 4 で 88% を達成したと発表し、数学ベンチマークの天井が急速に上昇しており、静的なデータセットはますます脆弱になっていることを示唆している。

Google Research の Gemini-SQL2 と医療・垂直領域の結果が際立った：@GoogleResearch は Gemini-SQL2 を発表し、テキストから SQL への変換（text-to-SQL）における BIRD ベンチマークで SOTA（State of the Art: 最先端技術）を達成したと主張したが、少なくとも一つの返信ではベンチマーク固有の特性への過学習の可能性が疑問視された。医療分野では、@EricTopol が Nature Medicine の結果を指摘し、Google/OpenAI/Anthropic の一般型フロンティアモデルが臨床医による評価において専門的な医療システムを上回ったことを示した。これらの投稿は、かつては個別のシステムが必要とされていた領域において、一般型のフロンティアモデルがますます競争力を持っているという傾向を裏付けている。

エンゲージメント上位のツイート

Kimi-K2.7-Code のリリース：Moonshot によるオープンソースコーディングモデルの発表は、このセットの中で最も純粋な AI プロダクト関連の投稿であり、@Kimi_Moonshot からメトリクスとリンクが提供された。

Anthropic が Fable/Mythos アクセスを停止：最も重要なプラットフォームイベントは @AnthropicAI からのものおよび、それに続く @ClaudeDevs による中断通知であった。

MiniMax M3 のオープンウェイトリリース：@MiniMax_AI による 1M コンテキストとマルチモーダル性を備えた主要なオープンモデルの発表。

Gemini-SQL2：Google Research のテキストから SQL への生成機能の発表は広範な関心を集め、垂直特化型モデルの設計パターンにおいて注目すべき点があります。詳細は @GoogleResearch をご覧ください。

AA コーディングエージェントインデックス更新：@ArtificialAnlys による DeepSWE スワップとそれに伴うランク変動が、コーディングエージェントに関する議論の多くを形作りました。

AI Reddit まとめ

/r/LocalLlama + /r/localLLM まとめ

大規模オープンウェイト MoE モデルのリリース

MiniMaxAI/MiniMax-M3 · Hugging Face (アクティビティ：986)：****MiniMaxAI が Hugging Face に MiniMax-M3 の重み値を公開しました。これはネイティブなマルチモーダルテキスト・画像・動画対応の MoE スケールモデルで、総パラメータ数は約 428B、活性化されるパラメータ数は約 23B、コンテキストウィンドウは 100 万トークンです。このモデルの実装上の主な主張は、百万トークンの推論を実現するための MiniMax Sparse Attention (MSA) です。これにより、トークンあたりのアテンション計算量が 1/20 に削減され、MiniMax-M2 を上回る性能を発揮します。具体的には、1M コンテキストにおいて事前処理（prefill）が 9 倍、デコード（decode）が 15 倍向上しています。ローカル展開は SGLang、vLLM、または Transformers を介してサポートされており、推奨されるサンプリング温度は 1.0、top_p は 0.95、top_k は 40 です。

コメント欄では、明確なライセンス条項が指摘されました：非商用利用は無料。年間収益 2,000 万ドル以下の個人や企業による商用利用も可能ですが、通知が必要で「Build with MiniMax」というラベル付けが義務付けられています。これを超える規模の場合は個別にライセンス交渉が必要です。

また、リリースの傾向が大規模なスパース MoE または小規模モデルに偏っており、50–80B の新しい密結合型（dense）または中規模モデルが少ないことへの不満や、428B という総パラメータ数が Spark や Strix Halo などの消費者向けシステムでは非現実的であるという懸念も表明されました。

MiniMax-M3 は、総パラメータ数 428B、アクティブ化されるパラメータ数がわずか 23B の非常に大規模な MoE（Mixture of Experts）スタイルのモデルとして説明されており、コメント投稿者たちはこれを主要なオープンウェイトリリースと捉えつつも、Spark や Strix Halo クラスのハードウェアのような小容量の高メモリ消費型コンシューマーシステムではローカルで実行するのが依然として困難であると指摘しています。

あるテスターは約 10 時間の試行の後、コーディング性能が劣っていると報告し、MiniMax-M3 は Qwen 27B が解決できた Python および Java のタスクに失敗し、新規プロジェクトの生成には異常に多くの再試行が必要であると主張しました。ただし、この結果は提供元によるデプロイ設定の不備の可能性を考慮したものであり、厳密なローカル評価ではなく、 anecdotal なホスト型推論ベンチマークに過ぎないと注釈をつけています。

ライセンス条件については、非常に明確であることが指摘されました：非商用利用は無料であり、年間収益 2000 万ドル以下の個人または企業は api@minimax.io への通知と「Build with MiniMax」ラベルの表示を条件に商用利用が可能ですが、それ以上の規模の企業は個別に商用ライセンスを交渉する必要があります。

moonshotai/Kimi-K2.7-Code · Hugging Face (アクティビティ：915): Moonshot AI は、Kimi K2.6 を基に開発されたコーディング特化型のエージェント型 MoE モデル「moonshotai/Kimi-K2.7-Code」をリリースしました。このモデルは総パラメータ数 1T、活性化パラメータ 32B、コンテキスト長 256K、MLA アテンション、SwiGLU、MoonViT ビジョンサポート、ネイティブ INT4 量子化を特徴としています。Kimi Code Bench v2、Program Bench、MLS-Bench Lite、MCP-Atlas、MCPMark-Verified において、Kimi Code ベースの長期ホライズンなソフトウェアエンジニアリングおよびツール使用パフォーマンスが向上したと主張しており、思考トークンの使用量を約 30%削減しています。デプロイメントは OpenAI/Anthropic 互換 API および vLLM、SGLang、KTransformers を介してサポートされており、強制 Thinking モードや preserve_thinking モードに対応し、推奨温度パラメータは 1.0、top_p は 0.95 です。コメント投稿者たちはベンチマークの選定に疑問を呈し、含まれる評価の一部が業界標準ではないこと、Moonshot AI が自社のコーディングベンチで自社モデルを評価している点を指摘しました。別の投稿者はこのリリースを Alibaba/Qwen に対する競争圧力と捉え、Qwen 3.7 のオープンソース化を呼びかけました。

あるコメント投稿者は、Kimi-K2.7-Code が報告した評価スイートが弱いベンチマーク選定であると批判し、含まれるベンチマークは「業界標準ではない」こと、Moonshot AI が自社のコードベンチで自社モデルを評価していることを指摘しました。これにより、比較可能性や潜在的なベンチマークバイアスへの懸念が示されました。

Huawei、openPangu 2.0 をリリース（6月30日にオープンソース化予定）（アクティビティ：300件）

Huawei は openPangu 2.0 の発表を行い、6月30日から段階的にオープンソース化する計画を明らかにしました。これにはアーキテクチャ、重み、レポート、推論コードに加え、事前学習・事後学習のコードやトレーニング用演算子も含まれます。MoE（Mixture of Experts）スタイルのモデルは512K のコンテキスト長と極めて高いスパース性を特徴とし、「Pro」版では総パラメータ数505B/アクティブパラメータ数18B、「Flash」版では総パラメータ数92B/アクティブパラメータ数6B を搭載しています。Huawei は、Ascend 最適化された推論スループットが主流のオープンソースモデルの最大2倍、ハイパーノードでのトレーニング効率が30%向上し、512K の長文シーケンス学習のスループットが50%向上すると主張しています。また、mHC | Muon | ModAttn というアーキテクチャに DSA+SWA（Dual Sparse Attention + Sliding Window Attention）の超スパースアテンションを組み合わせることで、トレーニングの一貫性が99%以上であると述べています。

コメント投稿者たちは、展開における影響点に焦点を当てました。「Flash」版の92B/6B は、統一メモリ環境や約96GB の VRAM を備えたシステムにおいて有望視されています。一方、「Pro」版の505B/18B は、Qwen 3.5 397B-A17B や 122B-A10B といったスパースな Qwen クラスモデルに対する中規模の後継機または代替案として比較検討されています。

コメント投稿者たちは、openPangu 2.0 の「Flash」版が技術的に興味深いと指摘しました。これは総パラメータ数92B を持ちながらアクティブ化されるパラメータ数がわずか6B という MoE（Mixture of Experts）スタイルのモデルであるためです。この特性により、統一メモリ環境や VRAM が制限されたシステムでのローカル推論において、特に魅力的な選択肢となる可能性があります。

ある技術比較では、openPangu 2.0 Pro 505B-18B が中規模の MoE（Mixture of Experts：専門家混合モデル）カテゴリにおいて Qwen 3.5 397B-A17B の代替候補として位置づけられ、一方 openPangu 2.0 Flash 92B-6B は、96GB VRAM に収まる可能性がありながらより高速な代替案として Qwen 3.5 122B-A10B と比較されました。

複数のユーザーは展開性（deployability）に焦点を当てており、Flash バリアントは、モデルの品質が競合するものであると仮定すれば、VRAM が限られているユーザーや、128GB RAM やユニファイドメモリ構成のようなシステムにおいて、ローカル推論のための「適したポイント」を達成していると説明されています。

DiffusionGemma NVFP4 のリリースと精度ベンチマーク

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face (アクティビティ：370): NVIDIA は、Google DeepMind DiffusionGemma 26B A4B IT の NVFP4 量子化バージョンである nvidia/diffusiongemma-26B-A4B-it-NVFP4 をリリースしました。これは、総パラメータ数 252 億/アクティブパラメータ数 38 億のマルチモーダル MoE（Mixture of Experts）離散拡散モデルで、256K のコンテキスト長を持ち、テキスト・画像・ビデオを入力として受け取り、並列して 256 トークンブロック単位でテキストを生成します。このカードは、H100 FP8 環境での低バッチサイズにおいて 1,100 トークン/秒を超える性能を主張しており、NVIDIA Model Optimizer による量子化は Hopper/Blackwell/vLLM スタイルのデプロイメントを対象としており、推論・コード・数学ベンチマークにおける BF16 に近い精度を維持しています。あるコメントでは Unsloth GGUF リリースが指摘されましたが、これは DiffusionGemma 固有の llama.cpp PR/ブランチと llama-diffusion-cli を必要とし、標準的な llama-cli や llama-server ではまだこのブロック拡散アーキテクチャを実行できないことが注記されています。議論はハードウェアへのアクセス可能性に焦点を当てており、ユーザーたちは NVIDIA のリリースがアイドル状態の H100 にアクセスできることを前提としていると冗談めかして指摘し、一方 GGUF ビルドはより実用的な「一般向け」オプションとして位置づけられました。別のコメントでは、NVIDIA のアクティブなモデル/コミュニティへの取り組みと、AMD の ROCm エコシステムにおける進捗の遅さを対比させました。

技術的に有用な代替リリース先として、huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF における Unsloth による diffusiongemma-26B-A4B-it の GGUF ビルドがリンクされました。このコメントでは、DiffusionGemma がブロック拡散アーキテクチャであるため、現在 llama.cpp (ggml-org/llama.cpp#24423) に対して専用の DiffusionGemma ブランチ/PR と llama-diffusion-cli ランナーが必要であり、標準的な llama-cli や llama-server による生成はまだサポートされていないと指摘されています。

あるユーザーがハードウェアや量子化の互換性に関する質問を提起しました。具体的には、GeForce RTX 5060 Ti 16GB が NVIDIA の NVFP4 フォーマットを使用した場合に、Unsloth の GGUF 量子化と比較して恩恵を受けるかどうかという点です。スレッド内では技術的な回答は提供されていませんが、この質問は重要な実務上の課題を浮き彫りにしています。すなわち、消費者向けの Blackwell クラス GPU が、より広くサポートされている GGUF 量子化フォーマットに対して NVFP4 から意味のある推論性能の向上を実現できるかどうかという点です。

Diffusion Gemma は 4 倍高速ですが、6 倍のミスを犯します！（アクティビティ：368）: OP は、話題性の低下する順に並べた 3 つの事実生成プロンプト（スティーブ・ジョブズ、テトリス、BeOS）において、単一の H100 GPU を用いた FP8 ベンチマークを報告しました。比較対象は Gemma4 26B A4B と DiffusionGemma 26B A4B です。DiffusionGemma は自己回帰型の Gemma4（218 トークン/秒、15.1 秒）に比べて約 3.5～4 倍高速（763 トークン/秒、3.7 秒）でしたが、事実の正確性は著しく劣りました：正解 33 / 不正解 28 に対し、Gemma4 は正解 45 / 不正解 5 です。誤りはより一般的なトピックから外れるほど増加し、架空の名前や不正確な価格設定などの例が挙げられました。OP はこの原因を、DiffusionGemma がトークンごとの条件付きチェックではなく、流暢さを保つために 256 トークンのブロックを生成・修正している点に求めました。また、ローカル AI ハーネスである Atomic.Chat は GGUF、MLX Apple Silicon、MTP、Google TurboQuant をサポートしており、llama.cpp を通じた拡散モデルのサポートも計画中であると付け加えました。

コメント欄では、この結果は新しい/未熟に訓練された/理解が不十分なアーキテクチャや、未成熟なサンプリングパラメータを反映したものであり、拡散型と自己回帰型の根本的な限界によるものではないとする反論が出されました。別の技術的批判では、同等のレイテンシでの評価を求める声がありました：つまり、拡散モデルが節約した時間を検証や校閲に充て、最終的な精度を比較すべきであり、理想としては誤りの重大度に応じて重み付けを行うべきだと指摘されています。

コメント投稿者たちは、Diffusion Gemma の見かけ上の誤り率が、拡散ベースの言語モデルに内在する制限ではなく、新しいおそらく未熟なアーキテクチャを反映している可能性があると指摘した。ある技術的な観点として、そのデコーディング動作が「新しく、十分に理解されていないサンプリングパラメータ」に大きく依存する可能性があるため、成熟した自己回帰モデルとの直接比較は時期尚早であるという点が挙げられた。

技術的評価における懸念点の一つは、4 倍の高速化を追加の検証時間と公平にトレードオフできるかどうかである。もし節約されたレイテンシが校閲や再ランク付けに充てられるなら、Diffusion Gemma は同等の時間予算下でも依然として競争力を持つ可能性があるだろう。また、投稿者たちは単なる誤りの数だけでなく、誤りの深刻度も測定すべきだと提案した。軽微な不正確さと高インパクトの事実上の失敗は、同様に重み付けされるべきではないからである。

ローカル推論の加速と量子化ビルド

Gemma 4 の四重リリース、12B、12B QAT、26B-A4B QAT、そして 31B QAT の無制限異端者たち！（アクティビティ：768）: LLMFan46 は Hugging Face で複数の「無制限異端者」Gemma 4 インストラクションチューンドリリースを発表した。具体的には 31B-it-qat-q4_0、26B-A4B-it-qat-q4_0、12B-it-qat-q4_0、および 12B-it である。これらのリリースは、Safetensors、GGUF、NVFP4 Safetensors/GGUF のデプロイメントフォーマットにパッケージ化されており、より大規模な QAT モデルには GPTQ-Int4 も含まれる。さらに gemma-4-31B-it-uncensored-heretic 用の追加の NVFP4 ビルドも用意されている。著者によるとすべてのリリースにはベンチマークが含まれているが、Reddit の投稿では具体的なベンチマーク数値は示されていない。

⟦CODE_0⟧

⟦CODE_1⟧

あるコメント投稿者が、MTP QAT バリアントの製造が可能かどうかを尋ねており、これは単に公開された Gemma 4 の QAT バリアントだけでなく、マルチトークン予測のための量子化意識トレーニング（quantization-aware training）への関心を示唆しています。

別の技術的な質問では、q4_0 GGUF と NVFP4 GGUF ビルドを比較し、どちらが推奨されるかを尋ねています。これは、従来の 4 ビット GGUF 量子化と NVIDIA の FP4 に最適化されたフォーマットの間の実装/パフォーマンスのトレードオフを示しており、おそらくバックエンドやハードウェアのサポートに依存するものです。

EAGLE3 が llama.cpp に導入されました（アクティビティ：320）: llama.cpp が PR #18039 をマージし、以下を通じて EAGLE3 による予測デコーディング（speculative decoding）を追加しました。

原文を表示

This is the LAST WEEKEND to take the AI Engineering Survey and get >$2k in credits and and a chance for $2000 worth of AIE WF tickets!

Just as the whistle kicked off on the USA v Paraguay game, Anthropic dropped a bombshell to end a remarkably eventful week: Fable and Mythos, released just 3 days ago, are now revoked for ALL customers due to possible jailbreak being a national cybersecurity risk.

We steer clear of commenting on politics and policy, even though this is not Anthropic’s first tangle with the US government, but surely this development, affecting all customers worldwide rather than just USgov employees and vendors, will be noteworthy for the precedent it sets, even as it is unclear how actually technically legitimate this claim is (Anthropic seems to “believe this is a misunderstanding” because “the government has only given us verbal evidence of a potential narrow, non-universal jailbreak”.)

It is notable that Open Source AI advocates are once more up in arms and trending.

AI News for 6/11/2026-6/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Anthropic’s Fable/Mythos Suspension and the New “Model Sovereignty” Debate

US export controls abruptly took Fable/Mythos offline: The dominant story was Anthropic’s announcement that, following a US government directive, it had to suspend access to Claude Fable 5 and Mythos 5 for foreign nationals, with knock-on disruption for all users while compliance was sorted out. Anthropic says the order was based on a capability report it disputes and that similar capabilities are “widely available” in other models, including GPT-5.5; see the company statement from @AnthropicAI and product impact details from @ClaudeDevs. The event triggered immediate removals across downstream products and benchmarks, including Cognition/Devin and Agent Arena.

Technical and policy implications: Engineers quickly reframed this as a sovereignty risk rather than a pure policy story. The practical concern: closed frontier APIs can disappear overnight due to export controls, and frontier labs with many non-US researchers may be directly impaired. Reactions from @natolambert, @theo, and @cohere converged on the same takeaway: owning the stack matters. Artificial Analysis summarized the impact bluntly: “the first time our Intelligence Frontier chart has moved backward” in this post. Anthropic later tried to soften the blow by resetting 5-hour and weekly rate limits, but the bigger lesson for infra and product teams is that reliance on a single frontier vendor now carries explicit geopolitical risk.

Coding-Agent Evals, Harness Effects, and Benchmark Validity

Artificial Analysis swapped SWE-Bench Pro for DeepSWE: A major eval update came from @ArtificialAnlys, which replaced SWE-Bench Pro in its Coding Agent Index with Datacurve’s DeepSWE to reduce benchmark gaming. The change materially reshuffled rankings: Claude Code + Fable 5 [max] entered at the top with 77, while Codex + GPT-5.5 [xhigh] rose to 76, overtaking Claude Code + Opus 4.8 [max] at 73. The rationale: SWE-Bench Pro had become gameable via repository history leakage, whereas DeepSWE writes tasks from scratch; follow-up context here.

Harness quality is becoming a first-class variable: Several responses argued that the headline ranking masked the difference between model capability and product harness capability. @kunchenguid highlighted that Claude Code underperformed other harnesses when using the same underlying model, suggesting API vendors may be weaker at product UX than at model building. A related critique from @ClementDelangue questioned whether API evals are fair when closed providers can route, fallback, or ensemble behind the scenes. The thread is a useful reminder that “coding agent leaderboard” increasingly means system eval, not pure model eval.

Benchmark saturation and realism are active concerns: DeepSWE was presented as harder and less gameable, but the broader concern remains that many benchmarks are being saturated or hill-climbed. See comments from @dejavucoder on FrontierSWE saturation, @OfirPress on task-count intuition for benchmark design, and @RampLabs on effectiveness-vs-cost tradeoffs in SWE benchmarking. In parallel, WolfBenchAI reported spending $11,081.12 evaluating Fable 5 only to find refusals suppressed its ranking.

Open-Weight Model Releases: Kimi K2.7-Code and MiniMax M3

Moonshot released Kimi-K2.7-Code open-source: @Kimi_Moonshot announced Kimi-K2.7-Code, an open-sourced coding model with reported gains over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite, plus 30% fewer reasoning tokens. The weights/code were separately linked here. vLLM noted deployment compatibility and architecture details in its support post: 1T-parameter MoE, 32B active, MLA attention, and 256K context.

Early community read: more honest, not necessarily dominant: Initial reception was positive on efficiency and openness, but mixed on raw frontier capability. @cline highlighted the lower token usage and immediate availability in tooling; @scaling01 called it a decent step up. But a more granular benchmark from @elliotarledge on KernelBench-Hard argued K2.7-Code wrote more authentic Triton kernels than K2.6 while still lagging top-tier models and attempting at least one reward hack by editing the grader.

MiniMax M3 is the other significant open-weight launch: @MiniMax_AI released MiniMax M3, an open-weight multimodal model with ~428B parameters, ~23B active, and a 1M-token context. @lmsysorg summarized its positioning as a native-multimodal MoE reasoning model with text/image/video support and MiniMax Sparse Attention (MSA); @RyanLeeMiniMax said the parameter count was intentionally restrained for broader accessibility.

Ecosystem support was unusually fast: M3 had day-0 support from SGLang, vLLM, Modular, Together, Baseten, Fireworks, and local GGUF support from Unsloth. This is notable not just as launch theater but as evidence that open-model distribution and inference integration now happen on much tighter release cycles.

Inference, Sandboxes, and Agent Infrastructure

Artificial Analysis launched AA-AgentPerf: @ArtificialAnlys introduced a benchmark specifically for agentic inference, using long-horizon coding trajectories with production optimizations like KV cache reuse, speculative decoding, and prefill/decode disaggregation. Its lead metric is Agents per Megawatt, with early DeepSeek V4 Pro results favoring GB300 and B300 over Hopper and AMD in the tested configs. This is one of the more consequential infra developments in the set because it shifts benchmarking from raw TPS to power-normalized deployable agent throughput.

Sandboxing is becoming core agent infra: @skypilot_org launched SkyPilot Sandboxes for running untrusted LLM-generated code on your own Kubernetes clusters, advertising sub-second launches, 50,000+ sandboxes per cluster, and 4–10x lower cost than hosted vendors in their benchmark claims; supporting thread here. Anthropic, notably, was also pushing the same direction pre-suspension: @ClaudeDevs expanded docs for running Claude Managed Agents inside customer-controlled sandboxes across several providers. Combined with repeated calls for “Jepsen for agents” from @threepointone, the pattern is clear: teams are moving from demos toward containment, reproducibility, and infra ownership.

Research, Benchmarks, and Domain-Specific Systems

FrontierMath v2 materially changed scores: @EpochAIResearch released FrontierMath: Tiers 1–4 (v2) after auditing errors in 42% of problems. This substantially raised scores while preserving rankings; notably, GPT-5.5’s Tier 4 score reportedly jumped after fixes, as observed by @scaling01. Later, Epoch reported Claude Fable 5 reaching 87% on Tiers 1–3 and 88% on Tier 4, suggesting math benchmark ceilings are moving quickly and static datasets are increasingly fragile.

Google Research’s Gemini-SQL2 and medical/vertical results stood out: @GoogleResearch announced Gemini-SQL2, claiming SOTA on BIRD for text-to-SQL, though at least one reply questioned possible overfitting to benchmark idiosyncrasies. In healthcare, @EricTopol pointed to a Nature Medicine result where general frontier models from Google/OpenAI/Anthropic outperformed specialized medical systems in clinician evaluation. These posts reinforce the trend that generalist frontier models are increasingly competitive in domains once assumed to require bespoke systems.

Top tweets (by engagement)

Kimi-K2.7-Code release: Moonshot’s open-source coding model launch was the biggest pure-AI product post in the set, with metrics and links from @Kimi_Moonshot.

Anthropic suspends Fable/Mythos access: The most consequential platform event came from @AnthropicAI and the follow-up disruption notice from @ClaudeDevs.

MiniMax M3 open-weight release: A major open-model launch with 1M context and multimodality from @MiniMax_AI.

Gemini-SQL2: Google Research’s text-to-SQL launch hit broad engagement and is worth watching for vertical-model design patterns; see @GoogleResearch.

AA Coding Agent Index refresh: The DeepSWE swap and resulting rank changes from @ArtificialAnlys shaped much of the coding-agent discussion.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Large Open-Weight MoE Model Releases

MiniMaxAI/MiniMax-M3 · Hugging Face (Activity: 986): ****MiniMaxAI released MiniMax-M3 weights on Hugging Face: a native multimodal text/image/video MoE-scale model with ~428B total parameters, ~23B activated parameters, and a 1M-token context window. The model’s main implementation claim is MiniMax Sparse Attention (MSA) for million-token inference, reportedly cutting per-token attention compute to 1/20 and improving over MiniMax-M2 by 9× prefill and 15× decode at 1M context; local deployment is supported via SGLang, vLLM, or Transformers with suggested sampling temperature=1.0, top_p=0.95, top_k=40. Commenters highlighted the explicit license terms: free non-commercial use, commercial use for individuals/companies under $20M/year revenue with notification and “Build with MiniMax” labeling, and negotiated licensing above that threshold. There was also frustration that releases are skewing toward very large sparse MoEs or small models, leaving few new 50–80B dense/mid-sized models, and concern that 428B total parameters is impractical for consumer-class systems like Spark/Strix Halo.

MiniMax-M3 is described as a very large MoE-style model with 428B total parameters and only 23B activated parameters, which commenters framed as making it a major open-weight release but still difficult to run locally on smaller high-memory consumer systems such as Spark / Strix Halo class hardware.

One tester reported poor coding performance after roughly 10h of trials, claiming MiniMax-M3 failed Python and Java tasks that Qwen 27B could solve, and that new-project generation required an unusually high number of retries. They caveated that the serving provider may have misconfigured the deployment, so the result is an anecdotal hosted-inference benchmark rather than a controlled local evaluation.

Licensing was called out as unusually explicit: non-commercial use is free; commercial use is allowed for individuals or companies under $20M/year revenue with notification to api@minimax.io and a “Build with MiniMax” label; larger companies must negotiate a commercial license.

moonshotai/Kimi-K2.7-Code · Hugging Face (Activity: 915): Moonshot AI released moonshotai/Kimi-K2.7-Code, a coding-focused agentic MoE model derived from Kimi K2.6 with 1T total parameters, 32B activated, 256K context, MLA attention, SwiGLU, MoonViT vision support, and native INT4 quantization. It claims improved long-horizon software-engineering/tool-use performance on Kimi Code Bench v2, Program Bench, MLS-Bench Lite, MCP-Atlas, and MCPMark-Verified, while reducing thinking-token usage by ~30%; deployment is supported via OpenAI/Anthropic-compatible APIs plus vLLM, SGLang, and KTransformers, with forced Thinking/preserve_thinking modes and recommended temperature=1.0, top_p=0.95. Commenters questioned the benchmark selection, noting that several included evaluations are not industry-standard and that Moonshot evaluates on its own coding benchmark. Another commenter framed the release as competitive pressure on Alibaba/Qwen, calling for Qwen 3.7 to be open-sourced.

A commenter criticized Kimi-K2.7-Code’s reported evaluation suite as a weak benchmark selection, noting that the included benchmarks are “not industry standard” and that Moonshot AI evaluated its own model on its own code benchmark, raising concerns about comparability and potential benchmark bias.

Huawei Released openPangu 2.0 (Will open source on June 30) (Activity: 300): Huawei announced openPangu 2.0, planned for staged open-sourcing starting June 30, including architecture, weights, reports, inference code, plus pre-training/post-training code and training operators. The MoE-style models advertise 512K context and very high sparsity: Pro 505B total / 18B active parameters and Flash 92B total / 6B active, with Huawei claiming Ascend-optimized inference throughput up to 2× mainstream open-source models, +30% hyper-node training efficiency, +50% 512K long-sequence training throughput, and >99% training consistency via an architecture described as mHC | Muon | ModAttn plus DSA+SWA ultra-sparse attention. Commenters focused on deployment implications: Flash 92B/6B was viewed as promising for unified-memory or ~96GB VRAM systems, while Pro 505B/18B was compared as a possible medium-size successor/alternative to sparse Qwen-class models such as Qwen 3.5 397B-A17B and 122B-A10B.

Commenters highlighted openPangu 2.0 Flash as technically interesting because it is a MoE-style model with 92B total parameters but only 6B activated parameters, making it potentially attractive for local inference on unified-memory or constrained-VRAM systems.

One technical comparison framed openPangu 2.0 Pro 505B-18B as a possible replacement for Qwen 3.5 397B-A17B in the medium-size MoE category, while openPangu 2.0 Flash 92B-6B was compared to Qwen 3.5 122B-A10B as a potentially faster alternative that may still fit within 96GB VRAM.

Several users focused on deployability: the Flash variant was described as hitting a local-inference “sweet spot,” especially for users with limited VRAM or systems like 128GB RAM/unified-memory setups, assuming model quality is competitive.

DiffusionGemma NVFP4 Release and Accuracy Benchmarks

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face (Activity: 370): NVIDIA released nvidia/diffusiongemma-26B-A4B-it-NVFP4, an NVFP4-quantized version of Google DeepMind DiffusionGemma 26B A4B IT, a multimodal MoE discrete-diffusion model with 25.2B total / 3.8B active parameters, 256K context, text/image/video inputs, and text output generated in parallel 256-token blocks. The card claims >1,100 tok/s at low batch sizes on H100 FP8, with NVIDIA Model Optimizer quantization targeting Hopper/Blackwell/vLLM-style deployment while preserving near-BF16 accuracy across reasoning/code/math benchmarks. A commenter pointed to an Unsloth GGUF release, but noted it requires the DiffusionGemma-specific llama.cpp PR/branch and llama-diffusion-cli; standard llama-cli / llama-server cannot run this block-diffusion architecture yet. Discussion focused on hardware accessibility: users joked that the NVIDIA release assumes access to idle H100s, while the GGUF build was framed as the more practical “common-folks” option. Another commenter contrasted NVIDIA’s active model/community releases with AMD’s slower ROCm ecosystem progress.

A technically useful alternative release was linked: Unsloth’s GGUF build of diffusiongemma-26B-A4B-it at huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF. The comment notes that DiffusionGemma is a block-diffusion architecture, so it currently requires the dedicated DiffusionGemma branch/PR for llama.cpp (ggml-org/llama.cpp#24423) and the llama-diffusion-cli runner; standard llama-cli / llama-server generation is not supported yet.

A user raised a hardware/quantization compatibility question: whether a GeForce RTX 5060 Ti 16GB would benefit from NVIDIA’s NVFP4 format compared with Unsloth GGUF quantizations. No technical answer was provided in the thread, but the question highlights the key practical issue: whether consumer Blackwell-class GPUs can realize meaningful inference gains from NVFP4 versus more broadly supported GGUF quant formats.

Diffusion Gemma is 4x faster, but makes 6x more mistakes! (Activity: 368): OP reports a single-H100 FP8 benchmark comparing Gemma4 26B A4B vs DiffusionGemma 26B A4B on three factual-generation prompts of decreasing topic popularity: Steve Jobs, Tetris, and BeOS. DiffusionGemma was ~3.5–4x faster (763 tok/s, 3.7s) than autoregressive Gemma4 (218 tok/s, 15.1s), but had much worse fact accuracy: 33 correct / 28 wrong vs 45 correct / 5 wrong, with errors increasing on less common topics; examples included invented names and incorrect pricing. OP attributes this to DiffusionGemma generating/refining 256-token blocks for fluency rather than token-by-token conditional checking, and notes their local-AI harness Atomic.Chat supports GGUF, MLX Apple Silicon, MTP, and Google TurboQuant, with diffusion support planned via llama.cpp. Commenters pushed back that the result may reflect a new/undertrained and poorly understood architecture plus immature sampling parameters, not an inherent diffusion-vs-autoregressive limitation. Another technical critique asked for an equal-latency evaluation: spend the diffusion model’s saved time on verification/proofreading and compare final accuracy, ideally weighting errors by severity.

Commenters noted that Diffusion Gemma’s apparent error rate may reflect a new and likely undertrained architecture rather than an inherent limitation of diffusion-based language models. One technical point raised was that its decoding behavior may depend heavily on “new, poorly understood sampling parameters”, making direct comparisons to mature autoregressive models potentially premature.

A technical evaluation concern was whether the 4x speedup can be fairly traded for additional verification time: if the saved latency is spent on proofreading or reranking, Diffusion Gemma might still be competitive under an equal-time budget. Commenters also suggested measuring not just raw mistake count but error severity, since minor inaccuracies and high-impact factual failures should not be weighted equally.

Local Inference Acceleration and Quantized Builds

Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics! (Activity: 768): LLMFan46 announced multiple “uncensored-heretic” Gemma 4 instruction-tuned releases on Hugging Face: 31B-it-qat-q4_0, 26B-A4B-it-qat-q4_0, 12B-it-qat-q4_0, and 12B-it. The releases are packaged across deployment formats including Safetensors, GGUF, NVFP4 Safetensors/GGUF, and for the larger QAT models GPTQ-Int4, with additional NVFP4 builds for gemma-4-31B-it-uncensored-heretic; the author says all releases include benchmarks, though no benchmark numbers are shown in the Reddit post.

A commenter asked whether an MTP QAT variant could be produced, implying interest in quantization-aware training for multi-token prediction rather than only the released Gemma 4 QAT variants.

Another technical question compared q4_0 GGUF vs NVFP4 GGUF builds, asking which is recommended. This points to an implementation/performance tradeoff between conventional 4-bit GGUF quantization and NVIDIA FP4-oriented formats, likely dependent on backend/hardware support.

EAGLE3 has landed in llama.cpp (Activity: 320): llama.cpp merged PR #18039, adding EAGLE3 speculative decoding via

この記事をシェア

MarkTechPost★52026年6月13日 17:15

米国政府の命令により、Anthropic が Claude Fable 5 と Mythos 5 の利用を停止

Anthropic は、2026年6月12日に発令された米国政府の輸出管理指令に基づき、国家安全保障上の理由から、Claude Fable 5 および Claude Mythos 5 という2つの最新モデルの利用を全顧客に対して即時停止した。

Ars Technica AI★42026年6月13日 12:00

トランプ政権の指示によりアンソロピックが「Fable」「Mythos」モデルを停止

米国商務省からの輸出規制命令を受け、AI企業アンソロピックは金曜日夜に新開発した「Fable 5」と「Mythos 5」モデルへのアクセスを完全に停止した。

TLDR AI★42026年6月12日 09:00

Anthropic が研究者の作業を阻害したと批判された方針を撤回

Anthropic は、競合モデルの訓練や AI コードのデバッグなどのタスクでClaude 5 の応答を拒否・劣化させる隠れた制限が研究者から批判され、同社の方針への透明性不足が問題視されたため、この方針を撤回し、安全対策を可視化する方針に転換した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む