Latent Space·2026年6月11日 12:14·約14分で読める

[AINews] オープンモデル、モデルラボとエージェントラボ、そして学習不可能なもの — サラ・グオ

#Agent Labs #Open Models #LLM #Benchmarking #Cognition

TL;DR

Sarah Guo は、オープンモデルの価値、エージェント開発における「翻訳」の重要性、ベンチマークの限界、そして計算資源よりも希少な「意図（Intent）」の必要性について、AI業界の未来を定義する重要なフレームワークを提示した。

AI深層分析2026年6月11日 18:09

重要/ 5段階

深度40%

キーポイント

Agent Labs vs Model Labs の本質的差異

Sarah Guo は、エージェント開発（Agent Labs）が単なるモデル構築ではなく、企業の非公開データを整理し、ツールを提供し、現場の現実を変革する「翻訳」作業であると指摘し、これが模倣困難な競争優位性になると論じた。

オープンモデルと実装の転換点

2024 年のオープンモデル懐疑論から、Cursor や Notion の成功を経て、2026 年にはオープンモデルが特定の文脈で決定的な役割を果たすようになり、その地位が再評価された。

ベンチマークの限界と「価値のない地図」

最も引用されるベンチマークスコアは、すぐに価値を失う領域の地図であり、誰が何を「良い」と定義する権利を持つかを示す通知に過ぎないと警告し、真の評価基準の不在を指摘した。

計算資源を超える希少性：意図（Intent）

AI モデルは指示されたことしか実行できず、何を作るべきかという「意図」を提供できないため、真のイノベーションは人間による「何を構築するか」という選択に依存し、これが計算資源よりも希少な入力であると結論付けた。

AI R&D における「沈黙的な性能低下」への批判

Anthropic が AI 研究関連のプロンプトに対して明確な断りなくモデル性能を意図的に低下させる行為が、再現性の欠如や信頼性の損傷をもたらすとして研究者から広く非難されている。

エンタープライズにおけるデータ保持とロックイン懸念

Fable/Mythos のプロンプト履歴保持期間（30 日）やオプトアウト機能の欠如が、ゼロデータ保持環境や欧州市場からの排除を招き、API を不安定な依存関係として扱うべきという教訓を生んでいる。

Fable 5 のベンチマーク性能は依然として卓越している

ポリシーへの批判にもかかわらず、Fable 5 はエージェントタスクやコーディング、交渉などの幅広い評価でトップレベルの性能を示しており、多くの批評家もモデル自体の質を認めている。

影響分析・編集コメントを表示

影響分析

この記事は、AI業界が単なるモデル性能競争から、実社会への適用と「意図」の定義へとシフトすべきであることを強く示唆しており、投資家や開発者に対して、ベンチマークスコアよりも実装の深さとドメイン特化型のエンジニアリングを重視するよう促す。特に、AI モデルが万能であるという幻想を打ち砕き、人間による戦略的選択（Intent）の重要性を再認識させることで、今後の技術投資や製品開発の方向性を大きく変える可能性がある。

編集コメント

Sarah Guo の洞察は、現在の AI バブル期における「何を作るか」という本質的な問いを投げかけ、単なる技術競争から戦略的価値創造への転換を促す重要な示唆に富む内容です。

Sarah Guo はポッドキャストの友人であり、AI の女王です。Satya とのクロスオーバー・エピソード（Gokul Rajaram による素晴らしい要約はこちら）の後、彼女の Substack で優れた記事を執筆しました。ぜひお読みいただき、その反応をこちらでお聞きください。

このフレームワーク（可視性に基づくもので、もしご存じなければこれもまた価値ある概念です）は、Satya ポッドキャストおよび過去 2 年間にわたる Latent Space で議論されてきた多くのテーマを同時に扱っています。

オープンモデルの位置づけ：Braintrust では 2024 年にオープンモデルの採用に対して最大限に悲観的でしたが、その後、Pmarca、Cursor、Notion を取り上げた 2026 年のポッドキャストで方針を転換しました。

エージェント・ラボ対モデル・ラボ：Sarah（Cognition の投資家）は、「Devin は細部に宿る」という考えに共鳴しています。「アプリケーションが『学習不能なコーナー』での地位を獲得するのは、地味な作業を行うことです。つまり、モデルがその上で行動できるように企業のプライベートな現実を整え、モデルに行動するためのツールを渡し、顧客と協力して労働力の現実を変革することです。翻訳を提供する企業は模倣が難しく、その翻訳作業は決して終わりません。統合とメンテナンスは、関係が続く限り続き、ドメイン特化型のエンジニアやツールを顧客の隣に配置したチームによって勝ち取られます。」

検証可能なベンチマークの自由：なぜ Anthropic などのラボが Fable のローンチのために FrontierCode をこれほど迅速に採用したのか、そして Sarah が私たちと同様に「今年最も引用されたベンチマークスコアは、間もなく価値を失う領土の地図であり、何が良しとされるべきかの権利を奪われようとしている誰かへの通告である」と同意する理由。

彼女は意図（Intent）に関するノートで締めくくります。「さらに難しいのは攻め方です。まず何を構築するかを選ぶことです。それが私が一年中探しているものであり、おそらく年に三回ほど見つけます。モデルはそこでは役立ちません。あなたが指し示したものは何でも実行しますが、どこを指すべきかの価値についてあなたに伝えることはできません。また、それをベンチマークすることもできないため、訓練することもできません。これが既存企業がすべてを受け入れない理由でもあります：彼らは現在の地盤を守り続け、次の革新は、他の人々がそれを見つける前に用途を見出す誰かから生まれます。もしかすると、意図（Intent）は計算資源（compute）よりもさらに希少な入力なのかもしれません。」

2026 年 6 月 9 日〜10 日の AI ニュース。私たちは 12 のサブレッドと 544 の Twitter を確認し、Discord は追加で調査していません。AINews のウェブサイトでは過去のすべての号を検索できます。念のためにお知らせしますが、AINews は現在 Latent Space の一部となっています。メールの頻度を選択的にオン/オフに設定可能です！

AI Twitter リキャップ

Anthropic の Fable/Mythos ロールアウト、サイレントな能力ゲート、そして信頼への反発

AI 研究開発における静かなる劣化が議論の中心を支配した：技術系ツイートの大規模な割合は、明確な事前開示なしに AI 研究関連のプロンプトに対するモデル性能が Anthropic によって明らかに低下していることに焦点を当てており、それらの要求に対して強く拒絶するのではなくである。批判は異例ほど広範であった：研究者や開発者はこれが観測された能力と実際のモデル能力の間に検証不可能なギャップを生み出し、再現性を損ない、コーディング、生物学、システム関連の隣接ドメインにおけるモデル出力への信頼を損なうと主張した。@natolambert, @martin_casado, @drfeifei, @antirez, @ClementDelangue, @deanwball からの代表的な批判があった。いくつかの投稿は、より狭い点として、Anthropic がフロンティアユースケースを制限したい場合でも、明示的な拒絶やモデルのダウングレードの方が、静かなるサボタージュよりも正当化されやすいと指摘した。例：@hlntnr, @arohan, @DBahdanau.

企業の懸念は安全性だけでなく、顧客維持とロックインにも及んでいた。ビルダーたちは、Fable/Mythos は一部の環境で 30 日間のプロンプト・データ保持があり、オプトアウト（退出）手段がないと報告されており、これが即時にゼロ保持環境や欧州の一部を排除すると指摘した。@GergelyOrosz のプロンプト履歴保持や不透明なモデル変更に関する見解、および @scaling01 のゼロデータ保持との互換性欠如に関する指摘を参照のこと。複数の実務者が繰り返す第 2 次の教訓は、最先端 API を不安定な依存関係として扱い、モデルの移植性を維持し、@dbreunig、@omarsar0、@yacineMTB が主張するように、評価（evals）とハーンセス（harnesses）を用いて出力を継続的に検証することである。

Anthropic はこの論争に政策推進を組み合わせた。批判が広がる中、Dario Amodei は「AI 指数関数的成長に関するポリシー」を発表し、AI の進展が制度の追従を超えていると主張して、より強力な最先端分野の監督を呼びかけた。同時に Anthropic は関連するイニシアチブを発表し、安全でないリリースを阻止するための政府の役割を提案した。@DarioAmodei と @AnthropicAI を参照のこと。コミュニティにとってこの緊張関係は明白だった：不透明な私的制御で批判されている同じ企業が、今度はより強力な公的制御を主張しているのである。

論争にもかかわらず、Fable 5 のベンチマークにおける強さと製品パフォーマンス

Fable 5 は、エージェント型およびコーディングのワークロードにおいて本格的に強力であることが示されています：Anthropic の方針に対する多くの批判者でさえも、モデル自体が優秀であると認めています。コミュニティからの報告では、幅広い評価項目で首位またはそれに準ずる位置を占めているとされており、Agent Arena では総合的に 1 位となり、特にタスクの成功確認やユーザーからの称賛において大きな差をつけましたが、操作性についてはやや劣りました。@mchlhess は「自分のベンチマークを完全に打ち破った」と述べており、@JasonBotterill は SimpleBench で 81.9% のスコアを記録したと指摘し、@lvwerra は CADGenBench で 1 位であると報告しました。また、@scaling01 はコンピュータ操作に関する結果の強さを強調し、@LechMazur は PACT 交渉タスクで 1 位であることを示唆しています。

開発者からは実世界での大きな成果が報告されましたが、その内容は一律ではありません。多くの実践者が、ゲーム生成や難易度の高いバグ修正など、長期にわたるコーディングおよび創造的なタスクにおいて生産性が大幅に向上したと述べています（例：@kimmonismus, @walden_yan, @hrishioa）。一方で、@Sentdex や @QuixiAI などの報告では、特定のタスクにおいて挙動が不安定であったり、消費コストが高かったり、あるいは GPT-5.5 よりも性能が劣るケースがあることも指摘されています。タイムライン全体からの結論として、Fable 5 は多くのエージェント型コーディングタスクにおいて最先端のモデルである可能性が高いものの、信頼性や製品上の制約が採用を現実的に阻害している状況です。

配布と統合は急速に進みました：Perplexity は @perplexity_ai と @AravSrinivas を通じて、Computer for Pro/Max ユーザー向けに Claude Fable 5 をオーケストレーターモデルとして追加しました。Apple の開発者たちは、@ClaudeDevs を介して、多段階推論、長いコンテキスト、コード利用のための Foundation Models フレームワークサポートを獲得しました。コミュニティの行動もまた、反発の後、OpenAI/Codex への置換圧力を示唆しており、@dylan522p は Anthropic から OpenAI へ利用シェアが移動していると報告しています。

Google の DiffusionGemma リリースと拡散型大規模言語モデル（Diffusion LLMs）への再関心

Google は Apache 2.0 ライセンスの下で DiffusionGemma をリリースしました：このセットにおける最も重要なオープンモデルの発表は、Gemma 4 に基づいて構築された実験的な 26B モデルである拡散型テキストモデル「DiffusionGemma」です。これはオープンウェイトとして Apache 2.0 で公開されました。自己回帰的な次トークン生成ではなく、同時にテキストブロックを生成・精緻化し、適切なハードウェア上では最大 4 倍の高速な出力と約 1,000+ トークン/秒の実行が可能であると主張されています。詳細は @Google、@GoogleDeepMind、@googlegemma、および @sundarpichai をご覧ください。

システムに関するストーリーは即座に届きました：このリリースは単なる研究成果物としての意義だけでなく、インフラストラクチャの進展にも寄与するものとして捉えられました。@vllm_project は、DiffusionGemma が vLLM でネイティブサポートされる最初の拡散型 LLM であると指摘し、H200 グラフィックボード 1 基上で FP8 形式を用いたバッチサイズ 1 の条件下で 1 秒あたり 1200 トークン以上の出力トークンを達成できると紹介しました。@danielhanchen は llama.cpp を介して GGUF 形式でローカル環境で動作している様子を示し、@UnslothAI は 18GB クラスのハードウェア上でのローカル実行を強調しました。また @_philschmid は、推論に必要なリソースとしてアクティブパラメータが 38 億（3.8B）であり、256 トークンのブロック単位でノイズ除去が行われると要約しました。

研究者たちが関心を持った理由：拡散モデルスタイルのテキスト生成は、反復的な精緻化、制約付き編集、中間埋め込み、エラー修正に関する問いを再燃させました。複数の反応では、これを製品化された競合としてではなく、非逐次デコードや精緻化重視タスクのための有望な研究方向として捉える傾向が見られました。詳細は @omarsar0, @mervenoyann, および @dbreunig を参照してください。

エージェントのツールリング、インフラ、およびベンチマーク：実ワークロードにおけるより構造化された環境

ベンチマークは、選好ベースからトレースに基づくエージェント指標へと移行しています：@arena は Agent Arena の背後にある手法を詳細に説明しており、各ステップで人間の選好に頼るのではなく、bash エラーやツールの幻覚（hallucination）、そして「狂気」といった客観的なシグナルを検出するために長期のトレースを採掘するものです。これは、タスクが数十回のツール呼び出しと 30 分にも及ぶトレースにわたるエージェント評価において、重要な方向性です。

メモリ、オーケストレーション、環境制御の成熟が進んでいます：エージェントを取り巻く欠落していたシステム層に焦点を当てた複数のローンチがありました。@Teknium は GUI ベースの Hermes エージェントプロファイルと、@Teknium を通じたメモリ/スキル更新のための Write Gate 承認コントロールを提供しました。@weaviate_io は Engram において、グループ、トピック、スコープを用いた構造化されたエージェントメモリを説明しました。@bromann は、クライアントサイドやブラウザの機能をエージェントループに組み込むべきだと主張しました。@FactoryAI は Factory Desktop で「Missions」をローンチしました。

検出、ルーティング、コミュニティハネス：@perceptroninc は Agentic Detection をローンチし、単一のショット検出器ではなく、多段階のズーム/推論ループを用いて密度の高い曖昧な視覚検出を実現しました。@vllm_project は、推論経済を中心に最適化されたコミュニティエージェントハネスである Inferoa を紹介しました。また、@Azaliamirh は DeLM（分散型マルチエージェントフレームワーク）を紹介し、これは集中型の代替案の半額以下のコストで Gemini 3-Flash を用いて SWE-bench Verified で 65.7% の達成率を記録したと報告されています。

追跡すべき最適化、検索、科学モデル化の研究

分散シャムシアンと Muon の最適化比較は依然として活発な議論の的でした：技術的に興味深いサブスレッドでは、ハイパーパラメータの調整と擬似逆行列安定化機能を有効にした後、Meta 製のチューニング済み DistributedShampoo が、スピードラン形式のタスクにおいて強力な Muon ベースラインに匹敵する結果を示しました。@arohan は、バニラパッケージ＋調整による検証損失が約 3.2766 であると報告しましたが、@kellerjordan0 は重要な安定化フラグがドキュメント化されていないため「バニラ」と呼ぶことに異議を唱えました。ここで得られる有用な示唆は、「勝者が宣言された」ことではなく、最適器の比較がいかに隠された実装詳細や数値計算に敏感であるかという点です。

後期相互作用型検索において、より高性能なカーネルが登場しました：@tonywu_71 は、ColBERT/ColPali/LateOn で使用される MaxSim 用の融合された Triton カーネル「late-interaction-kernels」をリリースし、メモリフットプリントがピクチャーのほんの一部であるにもかかわらず、PyTorch と数値的に同等であると主張しています。これは、マルチベクトル検索モデルのトレーニングと推論の両方において重要な意味を持ちます。

科学および多モーダルモデリング：@giffmana は、拡散動画モデルが V-JEPA や VideoMAE に比べて、いくつかのプローブにおいて物理情報をより線形にエンコードできることを示す新しい研究を指摘し、「ビデオ生成モデルは愚かな物理シミュレータである」という一般的な通説に異議を唱えました。バイオテック分野では、@edunov が DeCAF-Pearl を紹介しました。これはフローマップ共折りたたみモデルであり、品質を維持しつつ Pearl よりも約 5 倍高速であると報告されています。アーキテクチャ研究については、@ZyphraAI が Zamba2-VL を Apache 2.0 ライセンスの下でリリースし、ハイブリッド SSM-Transformer のアイデアを VLM（Vision-Language Model）に拡張しました。

エンゲージメント数の多いツイートトップ

ポリシー/ガバナンス：@DarioAmodei による「AI の指数関数的成長に関する政策」は、最もエンゲージメントの高い技術・政策投稿となり、フロンティア AI が機関の対応速度よりも急速に進展しているという枠組みを示しました。

セキュリティ/安全性の失敗モード：@jsrailton は、マルウェア作成者が LLM の拒否反応をトリガーして AI によるマルウェア分析を回避するために、核兵器や生物学的テキストを組み込む事例に大きな注目を集めました。これは攻撃者が安全性機能を悪用する具体的な例です。

オープンモデル：@googlegemma と @Google が発表した DiffusionGemma は、純粋なモデル公開に関する投稿の中で最も注目されました。

研究アクセスの規範：@drfeifei は、学界における広範なコンセンサスを簡潔に表明しました。すなわち、科学の進展には AI を含む最良のツールへのアクセスが必要であるという点です。

モデル能力シグナル：@mchlhess が Fable 5 のベンチマーク結果を「完全に打ち破った」と発言したことは、最も引用された能力評価の一つとなりました。

AI Reddit レビュー

/r/LocalLlama + /r/localLLM レビュー

オープンウェイトモデルのリリース：North Mini Code と DiffusionGemma

原文を表示

Sarah Guo is a friend of the pod and Queen of AI, and after our Satya crossover pod (great recap here from Gokul Rajaram) wrote an excellent article on her Substack. Go read it, and come back for this reaction:

This framework (based on legibility, another worthwhile concept if you are unfamiliar) simultaneously addresses a lot of the themes we have discussed on the Satya pod, but also Latent Space over the last two years:

The Place of Open Models: With Braintrust in 2024 we were maximally bearish on Open Model adoption, only to turn around by our Pmarca, Cursor, and Notion in 2026 pods

Agent Labs vs Model Labs: Sarah (a Cognition investor) echos the Devin is in the Details: “An application earns its place in the untrainable corner by doing unglamorous work: arranging a company’s private reality so a model can act on it, handing the model the tools to act, working with the customer to change the reality of its workforce. A company that brings the translation is tough to copy – and the translation never ends. Integration and maintenance run as long as the relationship does, won by teams that put domain-specialized engineers and tools next to the customer.”

Free Verifiable Benchmarks: Why labs like Anthropic were so quick to pick up FrontierCode for the Fable launch, and why Sarah agrees, even with us, that “The most cited benchmark score of the year is a map of territory about to be worthless, and a notice of who is about to lose the right to say what counts as good.”

She ends with a note on Intent: "Even harder is offense, choosing what to build in the first place. That’s what I spend the year looking for, and I find it maybe three times. The model is no help there. It will do whatever you point it at and can’t tell you what’s worth pointing it at, and you can’t benchmark that, so you can’t train it. It’s also the reason the incumbents don’t take everything: they keep the ground they have, and the next thing comes from someone who finds a use before the rest of us. Maybe intent is an even scarcer input than compute.”

AI News for 6/9/2026-6/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Anthropic’s Fable/Mythos rollout, silent capability gating, and the trust backlash

Silent degradation of AI R&D help dominated the discourse: A large share of technical tweets focused on Anthropic apparently degrading model performance on AI research-related prompts without clear up-front disclosure, rather than hard-refusing those requests. Criticism was unusually broad: researchers and builders argued this creates an unverifiable gap between observed and actual model capability, undermines reproducibility, and damages trust in model outputs for adjacent domains like coding, biology, and systems work. Representative critiques came from @natolambert, @martin_casado, @drfeifei, @antirez, @ClementDelangue, and @deanwball. Several posts made the narrower point that, even if Anthropic wants to restrict frontier-use cases, explicit refusals or model downgrades would be more defensible than silent sabotage, e.g. @hlntnr, @arohan, and @DBahdanau.

Enterprise concerns extended beyond safety to retention and lock-in: Builders highlighted that Fable/Mythos reportedly come with 30-day prompt/data retention and no opt-out in some settings, which immediately excludes zero-retention environments and parts of Europe. See @GergelyOrosz on prompt-history retention and opaque model changes, and @scaling01 on zero-data-retention incompatibility. A second-order lesson repeated by multiple practitioners: treat frontier APIs as unstable dependencies, maintain model portability, and verify outputs continuously with evals and harnesses, as argued by @dbreunig, @omarsar0, and @yacineMTB.

Anthropic paired the controversy with a policy push: Amid the backlash, Dario Amodei published “Policy on the AI Exponential”, arguing AI progress is outrunning institutions and calling for stronger frontier oversight; Anthropic simultaneously announced related initiatives and a proposed government role in blocking unsafe releases. See @DarioAmodei and @AnthropicAI. The tension was obvious to the community: the same company being criticized for opaque private controls is now advocating stronger public controls.

Fable 5’s benchmark strength and product performance despite the controversy

Fable 5 appears genuinely strong on agentic and coding workloads: Even many critics of Anthropic’s policy acknowledged the model itself is excellent. Community reports had it leading or near-leading on a wide mix of evaluations: Agent Arena showed #1 overall with especially large margins in confirmed task success and user praise, albeit weaker steerability; @mchlhess said it “completely demolishes” his benchmark; @JasonBotterill noted 81.9% on SimpleBench; @lvwerra reported #1 on CADGenBench; @scaling01 highlighted strong computer-use results; and @LechMazur flagged #1 on PACT negotiation.

Builders reported substantial real-world gains, but not uniformly: A number of practitioners described major productivity gains on long-horizon coding and creative tasks, including game generation and hard bug-fixing, e.g. @kimmonismus, @walden_yan, and @hrishioa. At the same time, others reported brittle behavior, expensive consumption, or worse performance than GPT-5.5 on specific tasks, such as @Sentdex and @QuixiAI. The net takeaway from the timeline: Fable 5 is plausibly state-of-the-art for many agentic coding tasks, but trust and product constraints are materially affecting adoption.

Distribution and integration moved quickly: Perplexity added Claude Fable 5 as an orchestrator model in Computer for Pro/Max users via @perplexity_ai and @AravSrinivas. Apple developers got Foundation Models framework support for Claude for multi-step reasoning, longer context, and code use via @ClaudeDevs. Community behavior also suggested substitution pressure toward OpenAI/Codex after the backlash, including @dylan522p reporting usage share moving from Anthropic toward OpenAI.

Google’s DiffusionGemma release and renewed interest in diffusion LLMs

Google released DiffusionGemma under Apache 2.0: The most important open-model launch in the set was DiffusionGemma, an experimental 26B MoE diffusion text model built on Gemma 4 and released with open weights under Apache 2.0. Instead of autoregressive next-token generation, it generates and refines blocks of text simultaneously, with claims of up to 4x faster output and around 1,000+ tokens/sec on suitable hardware. See @Google, @GoogleDeepMind, @googlegemma, and @sundarpichai.

The systems story landed immediately: The release mattered not just as a research artifact but as serving infrastructure progress. @vllm_project said DiffusionGemma is the first diffusion LLM natively supported in vLLM, citing 1200+ output tok/s at batch size 1 on a single H200 with FP8. @danielhanchen showed it running locally via llama.cpp with GGUFs; @UnslothAI emphasized local execution on 18GB-class hardware; and @_philschmid summarized the inference footprint as 3.8B active params and 256-token block denoising.

Why researchers cared: Diffusion-style text generation revives questions around iterative refinement, constrained editing, fill-in-the-middle, and error correction. Multiple reactions framed it less as a productized competitor and more as a fertile research direction for non-sequential decoding and refinement-heavy tasks; see @omarsar0, @mervenoyann, and @dbreunig.

Agent tooling, infra, and benchmarks: more structure around real workloads

Benchmarks are shifting from preference to trace-based agent metrics: @arena detailed the methodology behind Agent Arena, which mines long-horizon traces for objective signals like bash errors, tool hallucination, and “insanity” rather than relying on human preference for every step. This is an important direction for agent evals where tasks span dozens of tool calls and 30-minute traces.

Memory, orchestration, and environment control keep maturing: Several launches targeted the missing systems layer around agents. @Teknium shipped GUI-based Hermes Agent profiles and later Write Gate approval controls for memory/skill updates via @Teknium. @weaviate_io described structured agent memory using groups, topics, and scopes in Engram. @bromann argued for bringing client-side/browser capabilities into the agent loop. @FactoryAI launched Missions on Factory Desktop.

Detection, routing, and community harnesses: @perceptroninc launched Agentic Detection, using multi-call zoom/reason loops for dense ambiguous visual detection instead of a one-shot detector; @vllm_project highlighted Inferoa, a community agent harness optimized around inference economics; and @Azaliamirh introduced DeLM, a decentralized multi-agent framework that reportedly reaches 65.7% SWE-bench Verified with Gemini 3-Flash at less than half the cost of centralized alternatives.

Optimization, retrieval, and scientific-modeling work worth tracking

Distributed Shampoo vs Muon remained a live optimization thread: A technically interesting sub-thread showed tuned Meta DistributedShampoo matching strong Muon baselines on a speedrun-style task after hyperparameter tuning and enabling pseudo-inverse stabilization. @arohan reported validation losses around 3.2766 with vanilla package + tuning, while @kellerjordan0 pushed back on calling it “vanilla” because the critical stabilization flag was undocumented. The useful signal here is not “winner declared,” but that optimizer comparisons remain highly sensitive to hidden implementation details and numerics.

Late-interaction retrieval got better kernels: @tonywu_71 released late-interaction-kernels, fused Triton kernels for MaxSim used in ColBERT/ColPali/LateOn, claiming numerical equivalence to PyTorch at a fraction of the memory footprint. This should matter for both training and serving multi-vector retrieval models.

Scientific and multimodal modeling: @giffmana highlighted new work showing diffusion video models linearly encode physical information better than V-JEPA/VideoMAE on some probes, challenging a common “videogen models are dumb physics simulators” narrative. In biotech, @edunov introduced DeCAF-Pearl, a flow-map cofolding model reportedly ~5x faster than Pearl while maintaining quality. On architecture research, @ZyphraAI released Zamba2-VL under Apache 2.0, extending hybrid SSM-Transformer ideas into VLMs.

Top tweets (by engagement)

Policy / governance: @DarioAmodei on “Policy on the AI Exponential” was the highest-engagement technical/policy post, framing frontier AI as advancing faster than institutions can react.

Security / safety failure mode: @jsrailton drew major attention to malware authors embedding nuclear/biological text to trigger LLM refusals and evade AI malware analysis—a concrete example of attackers exploiting safety behavior.

Open models: @googlegemma and @Google on DiffusionGemma were the biggest pure model-release posts.

Research access norms: @drfeifei concisely stated the broad consensus from academia: scientific progress requires access to the best tools, including AI.

Model capability signal: @mchlhess saying Fable 5 “completely demolishes” his benchmark became one of the most-cited capability endorsements.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Open-Weight Model Drops: North Mini Code and DiffusionGemma

この記事をシェア

Latent Space★42026年6月19日 14:53

[AINews] GLM は GPT より優れているか？GLM-5.2 が実用性を証明、Z.ai が 12 月までに「Open Fable」を公開予定

Latent Space のニュースでは、中国のモデル「GLM-5.2」がベンチマークで優れた結果を示し実用性があると評価されたことと、Z.ai が 12 月までにオープンソースプロジェクト「Open Fable」を発表する見込みについて報じられています。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む