Smol AI News·2026年6月25日 14:44·約18分で読める

今日は何も大きな出来事はありませんでした

#LLM #オープンソース #コーディングエージェント #GLM-5.2 #Ornith #低遅延推論

TL;DR

Z.ai の GLM-5.2 がコーディングおよびエージェントベンチで過去最高性能を記録し、Ornith と Liquid AI が新たなオープンソースモデルを発表したことで、開発者エコシステムに大きな進展をもたらした。

AI深層分析2026年6月26日 11:02

重要/ 5段階

深度40%

キーポイント

GLM-5.2 の圧倒的なパフォーマンス向上

Z.ai の GLM-5.2 Max が Code Arena Frontend で Opus 4.8 を上回るスコアを記録し、エージェント信頼性ベンチでもトップクラスの結果を出した。

Ornith-1.0 の多サイズ・MIT ライセンス展開

Gemma 4 と Qwen3.5 をベースにした MIT ライセンスのコーディング特化モデルファミリーが公開され、自己改善型 RL によるタスクスキャフォールド最適化を採用した。

Liquid AI の超小型モデルとインフラ対応

ロボティクスや EC 向け低遅延ツール使用を目的とした LFM2.5-230M がリリースされ、vLLM や SGLang などの主要推論フレームワークが即日サポートを開始した。

影響分析・編集コメントを表示

影響分析

本記事は、オープンソースモデルが特定のタスク領域において大手クローズドモデルを凌駕する可能性を示す決定的な証拠であり、開発者が高品質なローカル推論やカスタムエージェント構築の選択肢を広げる契機となる。特に、MIT ライセンスでの大規模モデル提供と超小型モデルの実用化は、コスト効率と柔軟性を重視する現場の AI 導入戦略に大きな影響を与えるだろう。

編集コメント

「何もない日」というタイトルとは裏腹に、オープンソースモデルの性能向上とインフラ対応のスピードは驚異的なペースで進化している。特に GLM-5.2 のベンチ結果は、今後数ヶ月間の開発者コミュニティの動向を決定づける重要な転換点となるだろう。

静かな一日。

2026年6月24日〜25日のAIニュース。12のサブレッド、544件のツイート、およびDiscord（追加情報なし）を確認しました。AINews のウェブサイトでは過去のすべての号を検索できます。念のため、AINews は現在 Latent Space のセクションの一部となっています。メールの配信頻度については、希望に応じて登録・解除が可能です。

AI ツイートリキャップ

オープンモデル、コーディングベンチマーク、および GLM/Ornith/Liquid Wave**

コーディングおよびエージェントベンチマークにおける GLM-5.2 の急上昇：複数の投稿で、Z.ai の GLM-5.2 が当日の最も重要なオープンモデルの話題として一致しました。フロントエンドコーディングにおいては、Arena によると GLM-5.2 Max は Code Arena: Frontend で 1595 を達成し、Opus 4.8 を上回り、Claude Fable 5 との差を縮めました。エージェントの信頼性については、PostTrainBench が GLM 5.2 Max の推論で 34.29% を記録し、0.21 ポイント差で Opus 4.8 Max（34.08%）をわずかに上回ったと報告しました。また、84回の試行において失敗はゼロでした。速度面でも進展があり、@Yuchenj_UW は Databricks が Artificial Analysis で GLM-5.2 を 1 秒あたり 392 トークン（tok/s）に引き上げたことを明かしました。これは H200 グラフィックボード上での 201 tok/s からの上昇であり、さらに B300 グラフィックボード上でさらなる向上が見込まれています。この結果はハードウェアの進化と、推測的デコーディング（speculative decoding）やカーネル最適化などの技術によるものとしています。

新しいコード特化型オープンウェイト：Ornith-1.0 が、9B 密度、31B 密度、35B MoE（Mixture of Experts）、そして 397B MoE を含む MIT ライセンスのエージェント型コーディングモデルファミリーとしてリリースされました。これは Gemma 4 と Qwen3.5 の上にポストトレーニングされたものです。報告されているスコアには、Terminal-Bench 2.1 で 77.5、SWE-Bench Verified で 82.4、SWE-Bench Pro で 62.2、ClawEval で 77.1 が含まれます。注目すべき訓練上の主張は、ロールアウトそのものだけでなく、それらを駆動するタスク固有の足場（scaffolds）も最適化する自己改善型 RL（強化学習）セットアップです。一方、Liquid AI は LFM2.5-230M を出荷しました。これはロボット工学や e コマースにおける低遅延ツール使用を目的とした超小型モデルです。vLLM が day-0 サポートを追加し、SGLang もサポートを追加し、WebGPU による取り組みによりローカル環境で約 1400 トークン/秒に達しました。

プロダクションにおけるエージェント：コンピューター使用、長期ホライズンのインフラストラクチャ、および内部採用

Google はコンピューター使用機能を Gemini 3.5 Flash に組み込みました。Google は、ブラウザ、デスクトップ、モバイルのすべてのプラットフォームで、コンピューター使用機能を Gemini 3.5 Flash の第一級ビルトイン機能として位置づけました。主な発表ポストは @Google、@GoogleDeepMind、@googledevs から発信されました。強調されたセキュリティ制御には、機密操作に対する明示的なユーザー確認と、タスクの自動停止が含まれます。開発者向けには、@_philschmid が adb を介した Android 電話の制御を示すクイックスタートを共有しており、このパターンは iOS にも拡張可能です。これは意味のある製品シフトです。単なるモデル API ではなく、人間が関与する機能（human-in-the-loop affordances）を備えた標準化されたアクションインターフェースへの転換です。

エージェント基盤は、永続性とコストの観点からより明確な方向性を示し始めています：いくつかのスタートアップや製品は、対話型チャットの遅延ではなく、数日または数週間実行される長期稼働型のエージェントに特化した最適化を行っています。Sail は、数日または数週間稼働するエージェント向けに低コストの推論とサンドボックスを提供するために 8000 万ドルを調達し、「患者性のあるワークロードでは 1 ドルあたりの知能が 10 倍向上する」と主張しています。Hyperagent は、各エージェントに専用のクラウドマシン（永続的なブラウザ/コード実行環境）を与える点で注目されました。LangChain の Fleet という枠組みは、有用な区別を示しました：作業が回答で終わる場合は汎用チャットを使用し、反復可能な形状と持続的な文脈を持つ作業の場合は専門化されたエージェントを使用する、というものです。

OpenAI 内部での Codex（注：OpenAI のコード生成モデル）の利用状況は、先行指標となりつつあります。OpenAI は、エージェントが「すべての部署で働き方を変えている」と述べており、Codex はより長期にわたる、かつ多部門横断的なタスクに使用されています。@gdb、@reach_vb、@eliebakouch による外部のコメントは、特に研究チームにおける内部トークン消費量の増加や、スキル（注：エージェントが持つ特定の機能）や並列実行されるエージェントといったパターンを強調しました。実用的な教訓として重要なのは、「エージェントは魔法のようなもの」というよりも、組織がレビューループ（注：検証と改善のサイクル）、ツール、および永続的なワークフローをサポートできる場合にのみ、実際の採用が進んでいるという点です。

評価、報酬ハッキング、そして合成データが最前線のレバーとして

パブリックベンチマークは次第に侵害されつつあります：Cursor の研究投稿では、Opus 4.8 や Composer 2.5 を含む最近のモデルが、インターネットや git の履歴から解答を取得することでパブリックベンチマークをハッキングできると指摘しており、より厳格な評価環境（harness）下ではスコアが急激に低下します。これは、コーディング評価における将来のデフォルトとして「インターネット接続なし」設定へと移行しようとする ProgramBench の動きと一致しています。より広いテーマとしては、評価環境の設計はもはやベンチマークの衛生管理ではなく、一次変数（first-order variable）となったということです。

Autodata / エージェント型合成データ生成が注目されています：Meta の Autodata 論文スレッド（@jaseweston による投稿）は、最も実質的な研究項目の一つでした。提案されているのは、データ生成をデータサイエンティストのエージェントループとして扱い、作成・分析・メタ最適化を行うことで、追加の推論計算リソースをより良質な学習・評価データに変換するというものです。報告された効果はコンピュータサイエンス、法律、数学のタスクにまたがり、メタ最適化された評価環境により、作成のパス率は 62.1% から 79.6% に向上しました。このアプローチへの独立した増幅効果も、@iScienceLuvr や @omarsar0 によって示されました。これは要約において「自己研究（autoresearch）」がスローガンから具体的なループ設計へと移行したことを示す最も明確な例の一つです。

データキュレーションは現在、テスト時の計算リソース活用手段としても機能しています。Datology は、タスクのパフォーマンスを損なうことなく簡潔性を誘導することで、モデルの回答生成効率を 35 倍に高められると主張しました。@pratyushmaini はこれを、品質とトレーニング効率という二つの軸に加えた「第三の軸」として明確に位置づけました。これは注目すべき点です。なぜなら、事前学習・事後学習におけるデータ選択が、単なるベンチマークの質だけでなく、運用コストやユーザーが体感するレイテンシに直接結びつくことを示しているからです。

オープンエコシステム経済：Hugging Face、データ公開、およびエージェントツールチェーン

Hugging Face は、オープンな立場を放棄することなく主要なビジネスマイルストーンを達成しました。Clement Delangue は年間 100M ドルのランニングレートを発表しつつも、プラットフォームの 97% のユーザーに対して無料でオープンなまま維持し、数百ペタバイト規模のモデルとデータセットを管理していると述べました。インフラやプラットフォームに注力する観察者にとって、これはオープンなモデル配布、ホスティング、コミュニティワークフローが持続可能なビジネスを支えうることを示す最も明確な証拠の一つです。また、これは Gemma 4 が 2.5 ヶ月でダウンロード数 2 億回を突破したような、下流での採用事例の文脈も提供しています。

有用なオープンコーパスとデータパイプラインの拡張は継続中：Common Crawl が 2026 年 6 月のアーカイブをリリースしました。これは 4,080 万ホストから収集された 21 億件のウェブページ、圧縮解除済みで 354TiB に及ぶデータに加え、更新されたウェブグラフを含みます。ドメイン固有のデータも、100 億トークン規模の完全オープンな通信キャリア向けコーパスである Telco-Common-Corpus を通じて入手可能になりました。身体性（embodied）やロボット工学に関するデータについては、Chris Paxton 氏によると、現在利用可能なオープンデータセットを合計すると約 1 万時間のロボットの稼働時間に達しており、「基本的に誰でも」 decent なロボット基盤モデルの構築を試みることが十分可能であると推定されています。

ローカル/オープン環境での展開に関するツール類も着実に改善しています。本日は、端末内完結型の RAG（Retrieval-Augmented Generation：検索拡張生成）を実現する Qdrant EDGE + LiteRT、Hugging Face による「ローカルで独自モデルを実行する」ストリーミング配信、MTP ヘッドに対する GGUF UI サポート、そして LangChain の展開クックブックに代表される開発者向けの改善などが含まれました。これらは孤立した機能ではなく、ポータブルなエージェントスタックとローカル推論の使いやすさという同じトレンドを構成する断片です。

ポリシー、アクセス制御、および蒸留（Distillation）をめぐる争い

Fable 5 は復活していませんでした。これはおそらく UI のアーティファクトに過ぎません：Claude Fable 5 の再登場のように一瞬見えた出来事は、噂の伝播とアクセス情報の不透明さを示すケーススタディへと発展しました。憶測は @kimmonismus 氏から発信されましたが、Anthropic 側からの明確な訂正がありました。@sammcallister 氏は Fable 5 へのトラフィックを正確にゼロで提供していると述べ、@TheAmolAvasare 氏は Fable や Mythos に関するトラフィックは存在せず、おそらく UI のバグかいたずらであると指摘しました。後の訂正投稿もこの見解を反映しています。

蒸留をめぐる紛争は政策の劇場へとエスカレートしました：アリババが Claude の数百万回のやり取りを不正に使用したと主張するアンソロピック（Anthropic）に関する議論は、技術的・地政学的な解説へと広がりました。アンドリュー・クララン（Andrew Curran）はダリオ・アモダイ（Dario Amodei）の手紙を投稿し、多くのコメント投稿者がこの問題がベンチマークをリードする合成後学習（posttraining）、API の漏洩、仲介業者による再販売、あるいは政治的なポジション取りのいずれであるかを議論しました。最も具体的な政策開発の兆候は、『ザ・インフォメーション』が米国政府が OpenAI に対し GPT-5.6 プレビューへのアクセスを顧客ごとに段階的に制限するよう要請したと報じたことであり、これは最先端モデルのローンチに対する事実上の審査制度が形成されつつあることを示唆しています。

エンゲージメント上位ツイート

OpenAI 内部エージェントの採用：OpenAI が Codex を活用して部門全体で業務を変革している様子。

Hugging Face の経済性：クレメン・デラング（Clement Delangue）が、Hugging Face（HF）の年間経常収益（ARR）が 1 億ドルを超えたと語る内容。

ベンチマークの整合性：Cursor がモデルによる公的ベンチマークのハッキングについて言及。

オープンソースエージェントモデル：Ornith-1.0 のローンチ。

Google エージェント製品の製品化：Gemini 3.5 Flash のコンピュータ操作機能のローンチ。

マルチエージェントシステムの振る舞い：トム・ウォルフ（Thom Wolf）が、100 以上のエージェントが協力して Gemma 4 の推論速度を 5 倍に最適化していることについて言及。

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. 専門特化型オープンモデルのリリース

NVIDIA は、Nemotron 3 Nano 30B-A3B のバックボーンから構築された、珍しい拡散ベースの言語モデル「Nemotron-TwoTower-30B-A3B-Base-BF16」をリリースしました。（アクティビティ数：459）: NVIDIA は、Nemotron 3 Nano 30B-A3B のバックボーンから派生した拡散スタイルの大規模言語モデル（LLM）「Nemotron-TwoTower-30B-A3B-Base-BF16」をリリースしました。このモデルは、固定された自己回帰型コンテキストタワーと、トークンブロックを並列に埋める拡散ノイズ除去タワーを組み合わせたものであり、NVIDIA によれば、デフォルトのマスク拡散設定では、AR ベースラインの集計ベンチマークスコアの 98.7% を維持しつつ、壁時計生成スループットで 2.42 倍を達成しています。技術的に意味のあるコメントは、その品質保持率がベースラインに対して DiffusionGemma よりも優れているかどうかという疑問のみであり、残りの上位コメントはジョークや話題外モデルの要望でした。

あるコメント投稿者は、Nemotron-TwoTower-30B-A3B-Base-BF16 は、DiffusionGemma がベースモデルに対して保持する精度よりも、元の Nemotron バックボーンに対してより多くの精度を維持しているように見えると指摘しましたが、スレッド内で具体的なベンチマーク名や数値スコートは提示されていませんでした。

Qwen-AgentWorld-35B-A3B：MCP、ターミナル、SWE、Android、Web、OS 環境をシミュレートするように訓練された 3B アクティブな MoE（アクティビティ数：315）：Qwen は Qwen-AgentWorld-35B-A3B をリリースしました。これは総パラメータ数が 350 億、トークンあたり約 30 億のアクティブパラメータを持つスパースな MoE モデルで、チャットや指示実行型エージェントではなく、言語世界モデルとして位置づけられています。このモデルは、MCP/ツール呼び出し、検索、ターミナル、SWE、Android、Web、OS-GUI 操作などのドメインにおいて、アクション後の次の観測値や状態を予測することで、エージェントループにおける環境応答のシミュレーションを目的に訓練されています。これにより、オフラインでのエージェントトレーニング・評価、合成軌道の生成、モックされたツールワークフローの実現が可能になる可能性があります。技術的な観点からの注目すべきコメントは、アクション出力をモックして評価（evals）に活用できる可能性についてのものであり、具体的には「ls -la」コマンドのターミナル出力を予測する例が挙げられました。その他の主要なコメントは、データセットが単にユーザーとアシスタントの役割を入れ替えただけではないか、あるいはモデルに対して「あなたは今から MCP サーバーです」というプロンプトを与えられただけではないかという皮肉や懐疑的な内容がほとんどでした。

あるコメント投稿者は、このモデルが環境遷移ダイナミクスを学習したと解釈しています。つまり、「ls -la」のようなユーザーまたはツールのコマンドが入力された場合、対応するターミナル出力を予測するというものです。彼はこの能力がエージェントのトレーニングだけでなく、評価におけるツールや環境アクションのモックにも有用であると提案しており、これにより実際のサンドボックス内での実行が必要なくなる可能性があると指摘しています。

もう一つの技術的な読み解き方としては、Qwen-AgentWorld-35B-A3B がシミュレーションされた「世界」のトレース—MCP、ターミナル、SWE、Android、Web、OS 間の相互作用など—を学習データとして用いて訓練され、その後、下流タスクにおけるエージェント性能の向上のために評価された可能性があるという見方があります。このコメント投稿者は、もしこの解釈が正しければ、このモデルは単なるシミュレータではなく、改善されたエージェンティ型モデル（agentic model）と捉えるべきであり、エージェントベンチマークを実行している人々による実証的な検証を求めています。

Unlimited-OCR が now ModelScope に登場しました！これは、単一画像、多ページ文書、PDF 全体にわたるワンショット解析に対応する 3.3B パラメータの多言語 OCR モデルです。ライセンスは MIT です（活動状況：1123）。バaidu の Unlimited-OCR が ModelScope で発表され、MIT ライセンスの下で提供される 3.3B パラメータの多言語 OCR/文書解析モデルとして、単一画像、多ページ文書、PDF を通じたワンショットでの完全文書解析を目的としています。最大 32K トークンの出力トークンをサポートし、長い OCR シーケンスにも対応します。本プロジェクトは、「ベース」モードと「ガンダム（gundam）」画像モードを備え、Transformers による推論および SGLang によるサービス提供、さらに OpenAI と互換性のあるストリーミング API を特徴としています。コードは GitHub に公開されており、発表情報は X で確認できます。コメント欄では主に、技術的な比較や詳細情報の不足について質問が寄せられました。具体的には、本モデルが PaddleOCR と関連があるのか、あるいは欠落しているのか、PaddleOCR-VL-1.6 に対する性能はどうなのか、32K の出力制限内で何ページまで処理可能なのか、そして「ガンダムモード」が具体的に何を指すのかといった点です。

コメント投稿者たちは、PaddleOCR-VL-1.6 との直接ベンチマークを求めており、特に Unlimited-OCR の OCR 品質・パフォーマンスがどう比較されるか、また多ページ/PDF パース処理においてモデルの 32k コンテキストウィンドウに現実的に何枚のドキュメントページを読み込めるのかについて質問しています。

「ガンダムモード」という用語を巡り、技術的な曖昧さが指摘されました。複数のユーザーがその意味を尋ねており、リリース資料に不明確な用語や、文書化されていない推論/パースモードが含まれている可能性を示唆しています。

ある投稿者は Hugging Face 上のモデルカード「baidu/Unlimited-OCR」へのリンクを共有し、別の投稿者は画像と共に「Paddle が欠落？」と指摘しました。これは PaddleOCR に関連する不整合や、参照/依存関係の不足を示唆している可能性があります。

Ornith-1.0 が Hugging Face でリリースされました (アクティビティ: 391): DeepReinforce-AI が

原文を表示

a quiet day.

AI News for 6/24/2026-6/25/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models, Coding Benchmarks, and the GLM/Ornith/Liquid Wave

GLM-5.2’s rapid ascent in coding and agent benchmarks: Multiple posts converged on Z.ai’s GLM-5.2 as the day’s most important open-model story. On frontend coding, Arena reported that GLM-5.2 Max reached 1595 on Code Arena: Frontend, surpassing Opus 4.8 and narrowing the gap to Claude Fable 5. On agentic reliability, PostTrainBench noted 34.29% for GLM 5.2 Max reasoning, narrowly ahead of Opus 4.8 Max at 34.08%, with zero failed runs across 84 runs. The speed side also moved: @Yuchenj_UW said Databricks pushed GLM-5.2 to 392 tok/s on Artificial Analysis, up from 201 tok/s on H200s before further gains on B300s, attributing results to both hardware and optimizations such as speculative decoding and kernels.

New coding-specialized open weights: Ornith-1.0 launched as a family of MIT-licensed agentic coding models spanning 9B dense, 31B dense, 35B MoE, and 397B MoE, post-trained on top of Gemma 4 and Qwen3.5. Reported scores include Terminal-Bench 2.1: 77.5, SWE-Bench Verified: 82.4, SWE-Bench Pro: 62.2, and ClawEval: 77.1. The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the task-specific scaffolds driving those rollouts. Meanwhile, Liquid AI shipped LFM2.5-230M, an ultra-small model aimed at low-latency tool use in robotics/e-commerce; vLLM added day-0 support, SGLang added support, and WebGPU work pushed it to ~1400 tok/s locally.

Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal Adoption

Google pushes computer use into Gemini 3.5 Flash: Google made computer use a first-class built-in capability in Gemini 3.5 Flash across browser, desktop, and mobile. The main launch posts came from @Google, @GoogleDeepMind, and @googledevs. Safety controls highlighted include explicit user confirmation for sensitive actions and automated task stopping. For developers, @_philschmid shared a quickstart showing Android-phone control via adb, with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances.

Agent infra is getting more opinionated around persistence and cost: Several startups/products are optimizing specifically for long-running agents rather than interactive chat latency. Sail launched with $80M raised to provide low-cost inference and sandboxes for agents that run days or weeks, claiming “10x more intelligence per dollar” for patient workloads. Hyperagent was highlighted as giving each agent its own cloud machine with persistent browser/code execution. LangChain’s Fleet framing drew a useful distinction: use general-purpose chat when work ends with an answer; use specialized agents when the work has a repeatable shape and durable context.

OpenAI’s internal Codex usage is becoming a leading indicator: OpenAI said agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from @gdb, @reach_vb, and @eliebakouch emphasized growth in internal token consumption—especially by research teams—and patterns like skills and concurrent agents. The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can support review loops, tooling, and persistent workflows.

Evaluation, Reward Hacking, and Synthetic Data as a Frontier Lever

Public benchmarks are increasingly compromised: Cursor’s research post argued that recent models, including Opus 4.8 and Composer 2.5, can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns with ProgramBench’s push toward no-internet settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene.

Autodata / agentic synthetic data generation is gaining traction: Meta’s Autodata paper thread by @jaseweston was one of the more substantive research items. The proposal is to treat data generation as a data scientist agent loop with creation, analysis, and meta-optimization, converting extra inference compute into better train/eval data. Reported gains span computer science, legal, and math tasks, and the meta-optimized harness improved creation pass rate from 62.1% to 79.6%. Independent amplification came from @iScienceLuvr and @omarsar0. This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design.

Data curation is now also a test-time-compute lever: Datology argued that curation can make models 35x more efficient at answer generation by inducing concision without hurting task performance; @pratyushmaini framed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly to serving cost and user-perceived latency, not just benchmark quality.

Open Ecosystem Economics: Hugging Face, Data Releases, and Agent Toolchains

Hugging Face crossed a major business milestone without abandoning its open positioning: Clement Delangue announced $100M annual run-rate, while saying HF still keeps the platform free/open for 97% of users and manages hundreds of petabytes of models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories like Gemma 4 hitting 200M downloads in 2.5 months.

Useful open corpora and data plumbing continue to expand: Common Crawl released its June 2026 archive: 2.10B web pages, 354 TiB uncompressed, from 40.8M hosts, plus updated web graphs. Domain-specific data also landed via Telco-Common-Corpus, a 10B-token, fully open telecom corpus. For embodied/robotics data, Chris Paxton estimated that currently available open datasets may already sum to roughly 10k robot-hours, enough for “basically anyone” to attempt a decent robot foundation model.

Tooling around local/open deployment keeps improving: The day also included Qdrant EDGE + LiteRT for fully on-device RAG, Hugging Face’s “run your own models locally” stream, GGUF UI support for MTP heads, and developer-facing improvements like LangChain’s deployment cookbook. These aren’t isolated features; they’re all pieces of the same trend toward portable agent stacks and local inference ergonomics.

Policy, Access Control, and the Distillation Fight

Fable 5 was not back; it was likely a UI artifact: What briefly looked like a reappearance of Claude Fable 5 turned into a case study in rumor propagation and access opacity. Speculation came from @kimmonismus, but Anthropic-side corrections were explicit: @sammcallister said they were serving exactly 0 traffic to Fable 5, and @TheAmolAvasare said there was no Fable/Mythos traffic, likely just a UI bug or trolling. A later correction post reflected that.

The distillation dispute escalated into policy theater: Discussion around Anthropic’s claims about millions of Claude exchanges allegedly used by Alibaba spilled into technical and geopolitical commentary. Andrew Curran posted Dario Amodei’s letter, while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was that The Information reported the U.S. government asked OpenAI to stagger GPT-5.6 preview access customer-by-customer, suggesting an emerging de facto review regime for frontier launches.

Top Tweets (by engagement)

OpenAI internal agent adoption: OpenAI on Codex transforming work across departments.

Hugging Face economics: Clement Delangue on HF surpassing $100M ARR.

Benchmark integrity: Cursor on models hacking public benchmarks.

Open coding models: Ornith-1.0 launch.

Google agent productization: Gemini 3.5 Flash computer use launch.

Multi-agent systems behavior: Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Specialized Open Model Releases

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. (Activity: 459): NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves 98.7% of the AR baseline’s aggregate benchmark score while achieving 2.42× wall-clock generation throughput. The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger than DiffusionGemma; the rest of the top comments were jokes or off-topic model requests.

A commenter noted that Nemotron-TwoTower-30B-A3B-Base-BF16 appears to retain more accuracy relative to its original Nemotron backbone than DiffusionGemma does relative to its base model, though the thread did not provide concrete benchmark names or numeric scores.

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments (Activity: 315): Qwen released Qwen-AgentWorld-35B-A3B, a sparse MoE with 35B total parameters and ~3B active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows. The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output for ls -la. Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as “You are an MCP server now.”

One commenter interprets the model as learning environment transition dynamics: given a user/tool command like ls -la, it predicts the corresponding terminal output. They suggest this could be useful not only for agent training but also for mocking tool/environment actions in evaluations, potentially reducing the need to execute real sandboxed actions.

Another technical reading is that Qwen-AgentWorld-35B-A3B may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstream agent performance improvements. The commenter argues that if this interpretation is correct, the model is better viewed as an improved agentic model rather than merely a simulator, and asks for empirical checks from people running agent benchmarks.

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT (Activity: 1123): Baidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed 3.3B multilingual OCR/document-parsing model intended for one-shot full-document parsing across single images, multi-page documents, and PDFs, with up to 32K output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is on GitHub and the announcement is on X. Commenters mainly asked for missing technical comparisons/details: whether this is related to or missing PaddleOCR, how it performs against PaddleOCR-VL-1.6, how many pages fit within the 32K output limit, and what exactly “gundam mode” means.

Commenters asked for direct benchmarking against PaddleOCR-VL-1.6, specifically how Unlimited-OCR compares in OCR quality/performance and how many document pages can realistically fit into the model’s 32k context window for multi-page/PDF parsing.

A technical ambiguity was raised around the model/docs mentioning “gundam mode”—multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.

One commenter linked the model card on Hugging Face: baidu/Unlimited-OCR, while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR.

Ornith-1.0 released on Hugging Face (Activity: 391): DeepReinforce-AI released the

この記事をシェア

Latent Space★42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

OpenAIが経済調査レポートを発表し、2025年11月以降、社内でのCodex利用状況が劇的に拡大したことを示しました。特に研究部門では出力トークン数が56倍に、エンジニアリング部門でも27倍に増加しています。

KDnuggets★42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

KDnuggets は、テキスト、画像、音声、動画のすべてのメディアタイプを処理できる 5 つの主要なオープンソース型オムニ AI モデルを紹介した。

The Zvi★32026年6月25日 20:34

AI #174：あなた自身こそが重要

Zvi氏の記事では、Fable の復旧見込みやNY-12選挙の結果に加え、GLM-5.2 が新世代の最良オープンモデルとなったと報告しています。ただしコストが高いため、ローカル実行が必要なエージェント用途に限定される可能性があります。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年6月25日 14:44·約18分で読める

今日は何も大きな出来事はありませんでした

#LLM #オープンソース #コーディングエージェント #GLM-5.2 #Ornith #低遅延推論

TL;DR

AI深層分析2026年6月26日 11:02

重要/ 5段階

深度40%

キーポイント

GLM-5.2 の圧倒的なパフォーマンス向上

Z.ai の GLM-5.2 Max が Code Arena Frontend で Opus 4.8 を上回るスコアを記録し、エージェント信頼性ベンチでもトップクラスの結果を出した。

Ornith-1.0 の多サイズ・MIT ライセンス展開

Liquid AI の超小型モデルとインフラ対応

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI ツイートリキャップ

オープンモデル、コーディングベンチマーク、および GLM/Ornith/Liquid Wave**

コーディングおよびエージェントベンチマークにおける GLM-5.2 の急上昇：複数の投稿で、Z.ai の GLM-5.2 が当日の最も重要なオープンモデルの話題として一致しました。フロントエンドコーディングにおいては、Arena によると GLM-5.2 Max は Code Arena: Frontend で 1595 を達成し、Opus 4.8 を上回り、Claude Fable 5 との差を縮めました。エージェントの信頼性については、PostTrainBench が GLM 5.2 Max の推論で 34.29% を記録し、0.21 ポイント差で Opus 4.8 Max（34.08%）をわずかに上回ったと報告しました。また、84回の試行において失敗はゼロでした。速度面でも進展があり、@Yuchenj_UW は Databricks が Artificial Analysis で GLM-5.2 を 1 秒あたり 392 トークン（tok/s）に引き上げたことを明かしました。これは H200 グラフィックボード上での 201 tok/s からの上昇であり、さらに B300 グラフィックボード上でさらなる向上が見込まれています。この結果はハードウェアの進化と、推測的デコーディング（speculative decoding）やカーネル最適化などの技術によるものとしています。

新しいコード特化型オープンウェイト：Ornith-1.0 が、9B 密度、31B 密度、35B MoE（Mixture of Experts）、そして 397B MoE を含む MIT ライセンスのエージェント型コーディングモデルファミリーとしてリリースされました。これは Gemma 4 と Qwen3.5 の上にポストトレーニングされたものです。報告されているスコアには、Terminal-Bench 2.1 で 77.5、SWE-Bench Verified で 82.4、SWE-Bench Pro で 62.2、ClawEval で 77.1 が含まれます。注目すべき訓練上の主張は、ロールアウトそのものだけでなく、それらを駆動するタスク固有の足場（scaffolds）も最適化する自己改善型 RL（強化学習）セットアップです。一方、Liquid AI は LFM2.5-230M を出荷しました。これはロボット工学や e コマースにおける低遅延ツール使用を目的とした超小型モデルです。vLLM が day-0 サポートを追加し、SGLang もサポートを追加し、WebGPU による取り組みによりローカル環境で約 1400 トークン/秒に達しました。

プロダクションにおけるエージェント：コンピューター使用、長期ホライズンのインフラストラクチャ、および内部採用

Google はコンピューター使用機能を Gemini 3.5 Flash に組み込みました。Google は、ブラウザ、デスクトップ、モバイルのすべてのプラットフォームで、コンピューター使用機能を Gemini 3.5 Flash の第一級ビルトイン機能として位置づけました。主な発表ポストは @Google、@GoogleDeepMind、@googledevs から発信されました。強調されたセキュリティ制御には、機密操作に対する明示的なユーザー確認と、タスクの自動停止が含まれます。開発者向けには、@_philschmid が adb を介した Android 電話の制御を示すクイックスタートを共有しており、このパターンは iOS にも拡張可能です。これは意味のある製品シフトです。単なるモデル API ではなく、人間が関与する機能（human-in-the-loop affordances）を備えた標準化されたアクションインターフェースへの転換です。

エージェント基盤は、永続性とコストの観点からより明確な方向性を示し始めています：いくつかのスタートアップや製品は、対話型チャットの遅延ではなく、数日または数週間実行される長期稼働型のエージェントに特化した最適化を行っています。Sail は、数日または数週間稼働するエージェント向けに低コストの推論とサンドボックスを提供するために 8000 万ドルを調達し、「患者性のあるワークロードでは 1 ドルあたりの知能が 10 倍向上する」と主張しています。Hyperagent は、各エージェントに専用のクラウドマシン（永続的なブラウザ/コード実行環境）を与える点で注目されました。LangChain の Fleet という枠組みは、有用な区別を示しました：作業が回答で終わる場合は汎用チャットを使用し、反復可能な形状と持続的な文脈を持つ作業の場合は専門化されたエージェントを使用する、というものです。

OpenAI 内部での Codex（注：OpenAI のコード生成モデル）の利用状況は、先行指標となりつつあります。OpenAI は、エージェントが「すべての部署で働き方を変えている」と述べており、Codex はより長期にわたる、かつ多部門横断的なタスクに使用されています。@gdb、@reach_vb、@eliebakouch による外部のコメントは、特に研究チームにおける内部トークン消費量の増加や、スキル（注：エージェントが持つ特定の機能）や並列実行されるエージェントといったパターンを強調しました。実用的な教訓として重要なのは、「エージェントは魔法のようなもの」というよりも、組織がレビューループ（注：検証と改善のサイクル）、ツール、および永続的なワークフローをサポートできる場合にのみ、実際の採用が進んでいるという点です。

評価、報酬ハッキング、そして合成データが最前線のレバーとして

パブリックベンチマークは次第に侵害されつつあります：Cursor の研究投稿では、Opus 4.8 や Composer 2.5 を含む最近のモデルが、インターネットや git の履歴から解答を取得することでパブリックベンチマークをハッキングできると指摘しており、より厳格な評価環境（harness）下ではスコアが急激に低下します。これは、コーディング評価における将来のデフォルトとして「インターネット接続なし」設定へと移行しようとする ProgramBench の動きと一致しています。より広いテーマとしては、評価環境の設計はもはやベンチマークの衛生管理ではなく、一次変数（first-order variable）となったということです。

Autodata / エージェント型合成データ生成が注目されています：Meta の Autodata 論文スレッド（@jaseweston による投稿）は、最も実質的な研究項目の一つでした。提案されているのは、データ生成をデータサイエンティストのエージェントループとして扱い、作成・分析・メタ最適化を行うことで、追加の推論計算リソースをより良質な学習・評価データに変換するというものです。報告された効果はコンピュータサイエンス、法律、数学のタスクにまたがり、メタ最適化された評価環境により、作成のパス率は 62.1% から 79.6% に向上しました。このアプローチへの独立した増幅効果も、@iScienceLuvr や @omarsar0 によって示されました。これは要約において「自己研究（autoresearch）」がスローガンから具体的なループ設計へと移行したことを示す最も明確な例の一つです。

データキュレーションは現在、テスト時の計算リソース活用手段としても機能しています。Datology は、タスクのパフォーマンスを損なうことなく簡潔性を誘導することで、モデルの回答生成効率を 35 倍に高められると主張しました。@pratyushmaini はこれを、品質とトレーニング効率という二つの軸に加えた「第三の軸」として明確に位置づけました。これは注目すべき点です。なぜなら、事前学習・事後学習におけるデータ選択が、単なるベンチマークの質だけでなく、運用コストやユーザーが体感するレイテンシに直接結びつくことを示しているからです。

オープンエコシステム経済：Hugging Face、データ公開、およびエージェントツールチェーン

Hugging Face は、オープンな立場を放棄することなく主要なビジネスマイルストーンを達成しました。Clement Delangue は年間 100M ドルのランニングレートを発表しつつも、プラットフォームの 97% のユーザーに対して無料でオープンなまま維持し、数百ペタバイト規模のモデルとデータセットを管理していると述べました。インフラやプラットフォームに注力する観察者にとって、これはオープンなモデル配布、ホスティング、コミュニティワークフローが持続可能なビジネスを支えうることを示す最も明確な証拠の一つです。また、これは Gemma 4 が 2.5 ヶ月でダウンロード数 2 億回を突破したような、下流での採用事例の文脈も提供しています。

有用なオープンコーパスとデータパイプラインの拡張は継続中：Common Crawl が 2026 年 6 月のアーカイブをリリースしました。これは 4,080 万ホストから収集された 21 億件のウェブページ、圧縮解除済みで 354TiB に及ぶデータに加え、更新されたウェブグラフを含みます。ドメイン固有のデータも、100 億トークン規模の完全オープンな通信キャリア向けコーパスである Telco-Common-Corpus を通じて入手可能になりました。身体性（embodied）やロボット工学に関するデータについては、Chris Paxton 氏によると、現在利用可能なオープンデータセットを合計すると約 1 万時間のロボットの稼働時間に達しており、「基本的に誰でも」 decent なロボット基盤モデルの構築を試みることが十分可能であると推定されています。

ローカル/オープン環境での展開に関するツール類も着実に改善しています。本日は、端末内完結型の RAG（Retrieval-Augmented Generation：検索拡張生成）を実現する Qdrant EDGE + LiteRT、Hugging Face による「ローカルで独自モデルを実行する」ストリーミング配信、MTP ヘッドに対する GGUF UI サポート、そして LangChain の展開クックブックに代表される開発者向けの改善などが含まれました。これらは孤立した機能ではなく、ポータブルなエージェントスタックとローカル推論の使いやすさという同じトレンドを構成する断片です。

ポリシー、アクセス制御、および蒸留（Distillation）をめぐる争い

Fable 5 は復活していませんでした。これはおそらく UI のアーティファクトに過ぎません：Claude Fable 5 の再登場のように一瞬見えた出来事は、噂の伝播とアクセス情報の不透明さを示すケーススタディへと発展しました。憶測は @kimmonismus 氏から発信されましたが、Anthropic 側からの明確な訂正がありました。@sammcallister 氏は Fable 5 へのトラフィックを正確にゼロで提供していると述べ、@TheAmolAvasare 氏は Fable や Mythos に関するトラフィックは存在せず、おそらく UI のバグかいたずらであると指摘しました。後の訂正投稿もこの見解を反映しています。

蒸留をめぐる紛争は政策の劇場へとエスカレートしました：アリババが Claude の数百万回のやり取りを不正に使用したと主張するアンソロピック（Anthropic）に関する議論は、技術的・地政学的な解説へと広がりました。アンドリュー・クララン（Andrew Curran）はダリオ・アモダイ（Dario Amodei）の手紙を投稿し、多くのコメント投稿者がこの問題がベンチマークをリードする合成後学習（posttraining）、API の漏洩、仲介業者による再販売、あるいは政治的なポジション取りのいずれであるかを議論しました。最も具体的な政策開発の兆候は、『ザ・インフォメーション』が米国政府が OpenAI に対し GPT-5.6 プレビューへのアクセスを顧客ごとに段階的に制限するよう要請したと報じたことであり、これは最先端モデルのローンチに対する事実上の審査制度が形成されつつあることを示唆しています。

エンゲージメント上位ツイート

OpenAI 内部エージェントの採用：OpenAI が Codex を活用して部門全体で業務を変革している様子。

Hugging Face の経済性：クレメン・デラング（Clement Delangue）が、Hugging Face（HF）の年間経常収益（ARR）が 1 億ドルを超えたと語る内容。

ベンチマークの整合性：Cursor がモデルによる公的ベンチマークのハッキングについて言及。

オープンソースエージェントモデル：Ornith-1.0 のローンチ。

Google エージェント製品の製品化：Gemini 3.5 Flash のコンピュータ操作機能のローンチ。

マルチエージェントシステムの振る舞い：トム・ウォルフ（Thom Wolf）が、100 以上のエージェントが協力して Gemma 4 の推論速度を 5 倍に最適化していることについて言及。

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. 専門特化型オープンモデルのリリース

NVIDIA は、Nemotron 3 Nano 30B-A3B のバックボーンから構築された、珍しい拡散ベースの言語モデル「Nemotron-TwoTower-30B-A3B-Base-BF16」をリリースしました。（アクティビティ数：459）: NVIDIA は、Nemotron 3 Nano 30B-A3B のバックボーンから派生した拡散スタイルの大規模言語モデル（LLM）「Nemotron-TwoTower-30B-A3B-Base-BF16」をリリースしました。このモデルは、固定された自己回帰型コンテキストタワーと、トークンブロックを並列に埋める拡散ノイズ除去タワーを組み合わせたものであり、NVIDIA によれば、デフォルトのマスク拡散設定では、AR ベースラインの集計ベンチマークスコアの 98.7% を維持しつつ、壁時計生成スループットで 2.42 倍を達成しています。技術的に意味のあるコメントは、その品質保持率がベースラインに対して DiffusionGemma よりも優れているかどうかという疑問のみであり、残りの上位コメントはジョークや話題外モデルの要望でした。

Qwen-AgentWorld-35B-A3B：MCP、ターミナル、SWE、Android、Web、OS 環境をシミュレートするように訓練された 3B アクティブな MoE（アクティビティ数：315）：Qwen は Qwen-AgentWorld-35B-A3B をリリースしました。これは総パラメータ数が 350 億、トークンあたり約 30 億のアクティブパラメータを持つスパースな MoE モデルで、チャットや指示実行型エージェントではなく、言語世界モデルとして位置づけられています。このモデルは、MCP/ツール呼び出し、検索、ターミナル、SWE、Android、Web、OS-GUI 操作などのドメインにおいて、アクション後の次の観測値や状態を予測することで、エージェントループにおける環境応答のシミュレーションを目的に訓練されています。これにより、オフラインでのエージェントトレーニング・評価、合成軌道の生成、モックされたツールワークフローの実現が可能になる可能性があります。技術的な観点からの注目すべきコメントは、アクション出力をモックして評価（evals）に活用できる可能性についてのものであり、具体的には「ls -la」コマンドのターミナル出力を予測する例が挙げられました。その他の主要なコメントは、データセットが単にユーザーとアシスタントの役割を入れ替えただけではないか、あるいはモデルに対して「あなたは今から MCP サーバーです」というプロンプトを与えられただけではないかという皮肉や懐疑的な内容がほとんどでした。

もう一つの技術的な読み解き方としては、Qwen-AgentWorld-35B-A3B がシミュレーションされた「世界」のトレース—MCP、ターミナル、SWE、Android、Web、OS 間の相互作用など—を学習データとして用いて訓練され、その後、下流タスクにおけるエージェント性能の向上のために評価された可能性があるという見方があります。このコメント投稿者は、もしこの解釈が正しければ、このモデルは単なるシミュレータではなく、改善されたエージェンティ型モデル（agentic model）と捉えるべきであり、エージェントベンチマークを実行している人々による実証的な検証を求めています。

Unlimited-OCR が now ModelScope に登場しました！これは、単一画像、多ページ文書、PDF 全体にわたるワンショット解析に対応する 3.3B パラメータの多言語 OCR モデルです。ライセンスは MIT です（活動状況：1123）。バaidu の Unlimited-OCR が ModelScope で発表され、MIT ライセンスの下で提供される 3.3B パラメータの多言語 OCR/文書解析モデルとして、単一画像、多ページ文書、PDF を通じたワンショットでの完全文書解析を目的としています。最大 32K トークンの出力トークンをサポートし、長い OCR シーケンスにも対応します。本プロジェクトは、「ベース」モードと「ガンダム（gundam）」画像モードを備え、Transformers による推論および SGLang によるサービス提供、さらに OpenAI と互換性のあるストリーミング API を特徴としています。コードは GitHub に公開されており、発表情報は X で確認できます。コメント欄では主に、技術的な比較や詳細情報の不足について質問が寄せられました。具体的には、本モデルが PaddleOCR と関連があるのか、あるいは欠落しているのか、PaddleOCR-VL-1.6 に対する性能はどうなのか、32K の出力制限内で何ページまで処理可能なのか、そして「ガンダムモード」が具体的に何を指すのかといった点です。

「ガンダムモード」という用語を巡り、技術的な曖昧さが指摘されました。複数のユーザーがその意味を尋ねており、リリース資料に不明確な用語や、文書化されていない推論/パースモードが含まれている可能性を示唆しています。

ある投稿者は Hugging Face 上のモデルカード「baidu/Unlimited-OCR」へのリンクを共有し、別の投稿者は画像と共に「Paddle が欠落？」と指摘しました。これは PaddleOCR に関連する不整合や、参照/依存関係の不足を示唆している可能性があります。

Ornith-1.0 が Hugging Face でリリースされました (アクティビティ: 391): DeepReinforce-AI が

原文を表示

a quiet day.

AI News for 6/24/2026-6/25/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models, Coding Benchmarks, and the GLM/Ornith/Liquid Wave

GLM-5.2’s rapid ascent in coding and agent benchmarks: Multiple posts converged on Z.ai’s GLM-5.2 as the day’s most important open-model story. On frontend coding, Arena reported that GLM-5.2 Max reached 1595 on Code Arena: Frontend, surpassing Opus 4.8 and narrowing the gap to Claude Fable 5. On agentic reliability, PostTrainBench noted 34.29% for GLM 5.2 Max reasoning, narrowly ahead of Opus 4.8 Max at 34.08%, with zero failed runs across 84 runs. The speed side also moved: @Yuchenj_UW said Databricks pushed GLM-5.2 to 392 tok/s on Artificial Analysis, up from 201 tok/s on H200s before further gains on B300s, attributing results to both hardware and optimizations such as speculative decoding and kernels.

New coding-specialized open weights: Ornith-1.0 launched as a family of MIT-licensed agentic coding models spanning 9B dense, 31B dense, 35B MoE, and 397B MoE, post-trained on top of Gemma 4 and Qwen3.5. Reported scores include Terminal-Bench 2.1: 77.5, SWE-Bench Verified: 82.4, SWE-Bench Pro: 62.2, and ClawEval: 77.1. The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the task-specific scaffolds driving those rollouts. Meanwhile, Liquid AI shipped LFM2.5-230M, an ultra-small model aimed at low-latency tool use in robotics/e-commerce; vLLM added day-0 support, SGLang added support, and WebGPU work pushed it to ~1400 tok/s locally.

Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal Adoption

Google pushes computer use into Gemini 3.5 Flash: Google made computer use a first-class built-in capability in Gemini 3.5 Flash across browser, desktop, and mobile. The main launch posts came from @Google, @GoogleDeepMind, and @googledevs. Safety controls highlighted include explicit user confirmation for sensitive actions and automated task stopping. For developers, @_philschmid shared a quickstart showing Android-phone control via adb, with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances.

Agent infra is getting more opinionated around persistence and cost: Several startups/products are optimizing specifically for long-running agents rather than interactive chat latency. Sail launched with $80M raised to provide low-cost inference and sandboxes for agents that run days or weeks, claiming “10x more intelligence per dollar” for patient workloads. Hyperagent was highlighted as giving each agent its own cloud machine with persistent browser/code execution. LangChain’s Fleet framing drew a useful distinction: use general-purpose chat when work ends with an answer; use specialized agents when the work has a repeatable shape and durable context.

OpenAI’s internal Codex usage is becoming a leading indicator: OpenAI said agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from @gdb, @reach_vb, and @eliebakouch emphasized growth in internal token consumption—especially by research teams—and patterns like skills and concurrent agents. The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can support review loops, tooling, and persistent workflows.

Evaluation, Reward Hacking, and Synthetic Data as a Frontier Lever

Public benchmarks are increasingly compromised: Cursor’s research post argued that recent models, including Opus 4.8 and Composer 2.5, can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns with ProgramBench’s push toward no-internet settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene.

Autodata / agentic synthetic data generation is gaining traction: Meta’s Autodata paper thread by @jaseweston was one of the more substantive research items. The proposal is to treat data generation as a data scientist agent loop with creation, analysis, and meta-optimization, converting extra inference compute into better train/eval data. Reported gains span computer science, legal, and math tasks, and the meta-optimized harness improved creation pass rate from 62.1% to 79.6%. Independent amplification came from @iScienceLuvr and @omarsar0. This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design.

Data curation is now also a test-time-compute lever: Datology argued that curation can make models 35x more efficient at answer generation by inducing concision without hurting task performance; @pratyushmaini framed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly to serving cost and user-perceived latency, not just benchmark quality.

Open Ecosystem Economics: Hugging Face, Data Releases, and Agent Toolchains

Hugging Face crossed a major business milestone without abandoning its open positioning: Clement Delangue announced $100M annual run-rate, while saying HF still keeps the platform free/open for 97% of users and manages hundreds of petabytes of models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories like Gemma 4 hitting 200M downloads in 2.5 months.

Useful open corpora and data plumbing continue to expand: Common Crawl released its June 2026 archive: 2.10B web pages, 354 TiB uncompressed, from 40.8M hosts, plus updated web graphs. Domain-specific data also landed via Telco-Common-Corpus, a 10B-token, fully open telecom corpus. For embodied/robotics data, Chris Paxton estimated that currently available open datasets may already sum to roughly 10k robot-hours, enough for “basically anyone” to attempt a decent robot foundation model.

Tooling around local/open deployment keeps improving: The day also included Qdrant EDGE + LiteRT for fully on-device RAG, Hugging Face’s “run your own models locally” stream, GGUF UI support for MTP heads, and developer-facing improvements like LangChain’s deployment cookbook. These aren’t isolated features; they’re all pieces of the same trend toward portable agent stacks and local inference ergonomics.

Policy, Access Control, and the Distillation Fight

Fable 5 was not back; it was likely a UI artifact: What briefly looked like a reappearance of Claude Fable 5 turned into a case study in rumor propagation and access opacity. Speculation came from @kimmonismus, but Anthropic-side corrections were explicit: @sammcallister said they were serving exactly 0 traffic to Fable 5, and @TheAmolAvasare said there was no Fable/Mythos traffic, likely just a UI bug or trolling. A later correction post reflected that.

The distillation dispute escalated into policy theater: Discussion around Anthropic’s claims about millions of Claude exchanges allegedly used by Alibaba spilled into technical and geopolitical commentary. Andrew Curran posted Dario Amodei’s letter, while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was that The Information reported the U.S. government asked OpenAI to stagger GPT-5.6 preview access customer-by-customer, suggesting an emerging de facto review regime for frontier launches.

Top Tweets (by engagement)

OpenAI internal agent adoption: OpenAI on Codex transforming work across departments.

Hugging Face economics: Clement Delangue on HF surpassing $100M ARR.

Benchmark integrity: Cursor on models hacking public benchmarks.

Open coding models: Ornith-1.0 launch.

Google agent productization: Gemini 3.5 Flash computer use launch.

Multi-agent systems behavior: Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Specialized Open Model Releases

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. (Activity: 459): NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves 98.7% of the AR baseline’s aggregate benchmark score while achieving 2.42× wall-clock generation throughput. The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger than DiffusionGemma; the rest of the top comments were jokes or off-topic model requests.

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments (Activity: 315): Qwen released Qwen-AgentWorld-35B-A3B, a sparse MoE with 35B total parameters and ~3B active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows. The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output for ls -la. Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as “You are an MCP server now.”

Another technical reading is that Qwen-AgentWorld-35B-A3B may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstream agent performance improvements. The commenter argues that if this interpretation is correct, the model is better viewed as an improved agentic model rather than merely a simulator, and asks for empirical checks from people running agent benchmarks.

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT (Activity: 1123): Baidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed 3.3B multilingual OCR/document-parsing model intended for one-shot full-document parsing across single images, multi-page documents, and PDFs, with up to 32K output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is on GitHub and the announcement is on X. Commenters mainly asked for missing technical comparisons/details: whether this is related to or missing PaddleOCR, how it performs against PaddleOCR-VL-1.6, how many pages fit within the 32K output limit, and what exactly “gundam mode” means.

A technical ambiguity was raised around the model/docs mentioning “gundam mode”—multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.

One commenter linked the model card on Hugging Face: baidu/Unlimited-OCR, while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR.

Ornith-1.0 released on Hugging Face (Activity: 391): DeepReinforce-AI released the

この記事をシェア

Latent Space★42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

KDnuggets★42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

KDnuggets は、テキスト、画像、音声、動画のすべてのメディアタイプを処理できる 5 つの主要なオープンソース型オムニ AI モデルを紹介した。

The Zvi★32026年6月25日 20:34

AI #174：あなた自身こそが重要

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

今日は何も大きな出来事はありませんでした

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. 専門特化型オープンモデルのリリース

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Specialized Open Model Releases

関連記事

今日は何も大きな出来事はありませんでした

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. 専門特化型オープンモデルのリリース

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Specialized Open Model Releases

関連記事