Smol AI News·2026年5月11日 14:44·約17分

本日は特に目立った出来事なし

#インタラクションモデル #マルチモーダル AI #視覚的プロアクティブ性 #Thinking Machines #リアルタイム処理

TL;DR

Thinking Machines が、従来のターンベース型 LLM に依存せず、聴覚・視覚・思考・行動を同時並行で処理する「ネイティブ双方向インタラクションモデル」を発表し、人間と AI の帯域幅問題を解決する新たなパラダイムシフトを示した。

AI深層分析2026年5月12日 14:03

重要/ 5段階

深度40%

キーポイント

ネイティブ双方向インタラクションモデルの登場

Thinking Machines は、音声やツール使用を後付けするのではなく、ゼロからリアルタイム対話のために訓練された「インタラクションモデル」を発表した。

連続時間意識と並列処理の実現

このモデルは、聴く・話す・見る・考える・検索する・反応することを同時に実行可能であり、「今思考中」「今検索中」といった明確な境界線をなくした。

視覚的プロアクティブ性の付与

現在のシステムには欠けていた「姿勢が悪いと指摘する」「腕立て伏せを数える」ような、環境を監視し自発的に介入する能力が実装された。

専用システムの不要化（ゼロショット）

John Schulman 氏らによると、この新しい型シグネチャにより、以前は特殊なシステムが必要だったタスクが、特別な設定なしで処理可能になる。

影響分析・編集コメントを表示

影響分析

この発表は、AI が単なる情報処理ツールから、人間と常時接続し環境を認識して能動的に行動するパートナーへと進化することを示唆しています。特に視覚的プロアクティブ性の実装は、ロボット工学やリアルタイム監視システムなどへの応用可能性を大きく広げ、業界全体の開発パラダイムを「対話型」から「共存・共働型」へシフトさせる重大な転換点となります。

編集コメント

「ターンベース」という長年の制約を打破し、人間と AI の関係性を再定義する画期的なアプローチです。特に視覚的なプロアクティブ性は、実社会での AI 活用における最大の課題の一つである「受動的すぎる」状態を解決する鍵となるでしょう。

静かな一日。

**2026年5月9日〜11日のAIニュース。12のサブレッド、544件のツイート、およびDiscordはさらに確認されませんでした。AINews のウェブサイトでは過去のすべての号を検索できます。念のため、AINews は現在 Latent Space のセクションの一部となっています。メールの頻度を選択的に設定（購読または解除）することができます！

AI ツイートリキャップ

Thinking Machines のネイティブ相互作用モデルとターンベース型AIを超えた転換**

フルデュプレックス型マルチモーダル相互作用を第一級のモデル機能として：本日の最も明確な技術的テーマは、Thinking Machines が発表した「インタラクションモデル」のプレビューでした。これは、音声やターン制の切り替え、ツール利用などを既存のターンベース LLM に重ねるのではなく、リアルタイム相互作用のためにゼロから訓練されたモデルと説明されています。これに伴う技術記事および @johnschulman2、@soumithchintala、@cHHillee によるチームの見解は、これを人間↔AI の帯域幅の問題として捉えています：モデルは同時に聴き、話し、見、考え、検索し、反応できるべきです。デモでは、連続時間における意識の維持、割り込み処理、同時発話、視覚的な能動性、そして明示的な「今考えている／今検索している」という境界線なしでのバックグラウンドツール利用が強調されました。チームメンバーはまた、以前は専用システムを必要としていた多くのタスクが、タイプシグネチャが実質的に連続するオーディオ＋ビデオ＋テキスト → オーディオ＋テキストとなることでゼロショットで実行可能になる点も指摘しました (@johnschulman2)。

技術的になぜ重要か：複数の反応が同じ点を指摘した。これは「別のチャットボットのデモ」ではなく、インターフェースの前提条件そのものの変化である。@liliyu_lili は、現在のシステムに欠けているプリミティブとして視覚的な能動性（「姿勢が悪くなったら教えて」「腕立て伏せを数えて」といった機能）を挙げた。@rown はこれを、視覚的に能動的な最初の一般向け動画＋音声モデルと呼んだ。@kimmonismus と@giffmana の両者は、生来的な相互操作性こそが、単なるベンチマークの数値以上の深い革新であると強調した。この発表はまた、@swyx が指摘するように、「リアルタイム」マルチモーダルシステムに対する基準を暗黙的に引き上げる結果となった。実装の詳細については、@eliebakouch によって SGLang を使用しているスタックであることが明らかになった。

OpenAI のエンタープライズおよびセキュリティへの取り組み：デプロイメント会社と Daybreak

OpenAI は下流のサービスやデプロイメントへと進出している。OpenAI は、最先端モデルを実際のワークフローに展開する企業を支援するために構築された、過半数を保有する子会社「OpenAI Deployment Company」を発表した。重要な運用上の詳細は、Tomoro の買収を通じて 150 名のフォワードデプロイエンジニアおよびデプロイメントスペシャリストが加わることである。@gdb は、これに 19 社のパートナーから初期投資として 40 億ドルが投入されたことを引用している。複数の観察者はこれを、OpenAI が Palantir や Microsoft のようなフィールドエンジニアリングモデルを採用したと解釈している。@kimmonismus は、OpenAI が AI エコノミーにおけるデプロイメント層を支配したいと考えていると論じ、@matvelloso は、技術スタッフを顧客の運用現場に密着させるという歴史的なエンタープライズ成功のパターンへとこれを結びつけた。

⟦CODE_0⟧

技術的になぜ重要か：複数の反応が同じ点を指摘した。これは「別のチャットボットのデモ」ではなく、インターフェースの前提条件そのものの変化である。@liliyu_lili は、現在のシステムに欠けているプリミティブとして視覚的な能動性（「姿勢が悪くなったら教えて」「腕立て伏せを数えて」といった機能）を挙げた。@rown はこれを、視覚的に能動的な最初の一般向け動画＋音声モデルと呼んだ。@kimmonismus と@giffmana の両者は、生来的な相互操作性こそが、単なるベンチマークの数値以上の深い革新であると強調した。この発表はまた、@swyx が指摘するように、「リアルタイム」マルチモーダルシステムに対する基準を暗黙的に引き上げる結果となった。実装の詳細については、@eliebakouch によって SGLang を使用しているスタックであることが明らかになった。

OpenAI のエンタープライズおよびセキュリティへの取り組み：デプロイメント会社と Daybreak

OpenAI は下流のサービスやデプロイメントへと進出している。OpenAI は、最先端モデルを実際のワークフローに展開する企業を支援するために構築された、過半数を保有する子会社「OpenAI Deployment Company」を発表した。重要な運用上の詳細は、Tomoro の買収を通じて 150 名のフォワードデプロイエンジニアおよびデプロイメントスペシャリストが加わることである。@gdb は、これに 19 社のパートナーから初期投資として 40 億ドルが投入されたことを引用している。複数の観察者はこれを、OpenAI が Palantir や Microsoft のようなフィールドエンジニアリングモデルを採用したと解釈している。@kimmonismus は、OpenAI が AI エコノミーにおけるデプロイメント層を支配したいと考えていると論じ、@matvelloso は、技術スタッフを顧客の運用現場に密着させるという歴史的なエンタープライズ成功のパターンへとこれを結びつけた。

⟦CODE_1⟧

Daybreak: セキュリティ固有のモデル配布、ワークフロー、信頼ティア：OpenAI はまた、防御的なサイバー運用とソフトウェアの継続的なセキュリティを軸とした包括的な取り組み「Daybreak」を発表し、@sama 氏はこれを急速に向上する AI サイバー能力に対する実用的な対応として位置付けています。@TheRundownAI が要約した製品提案は、GPT-5.5、Codex、リポジトリ脅威モデリング、脆弱性発見、パッチ生成、レスポンス自動化を組み合わせたものであり、サイバー向け信頼アクセス（Trusted Access for Cyber）やより専門的な GPT-5.5-Cyber といった差別化されたアクセスティアを含んでいます。これは Anthropic のより制限的なサイバー姿勢とは対照的であり、この緊張関係は @kimmonismus によって捉えられています。安全なエージェントシステムを構築するチームにとって、@lukOlejnik からの別の警告が関連します。「あなたの LLM はセキュリティ境界ではない」— Microsoft Semantic Kernel では、フレームワークがモデル自体の失敗ではなくモデル出力を過信したため、プロンプトインジェクションがホストレベルの RCE（リモートコード実行）に転換された reportedly 報告されています。

エージェントハネス、ローカルファーストツールリング、およびコントロールサーフェス

より優れたエージェント制御プレーンが製品カテゴリとなりつつある：有用なエージェントには自律性が求められる一方で、エンジニアは可逆的で検証可能な制御を依然として望むという不満が繰り返し指摘されています。@itsclelia は aggit という Rust CLI でこれに対処しました。これはローカル/リモート環境向けで S3 ベースのストレージを利用するもので、メインの Git 履歴外でstash/branch/restore のセマンティクスを可能にします。同様の文脈で、@_catwu は複数の Claude Code エージェントを管理するための新しい Claude エージェント用ターミナル制御プレーンを紹介し、@cursor_ai は Cursor を Microsoft Teams に統合しました。これによりエージェントはスレッド全体を読み込み、プルリクエストを作成します。これらはすべて、「エージェントオーケストレーション」がプロンプトの技量だけでなく、具体的な UX パターンへと収束している兆候です。

Deep Agents / Hermes / ローカルエージェントは急速に成熟しています：@masondrxy は、Deep Agents CLI が会話中にコンテキストを失うことなく基盤となるモデルプロバイダーをホットスワップできることを指摘しました。これは多くのエージェントスタックがまだ欠いている非自明なシステム機能です。LangChain もまた、プロバイダー/モデル固有のチューニングのためのハネスプロファイル（ツイート）を強調し、同じ著者による別の価格分析では、高ボリュームのエージェントワークロードにおいて DeepSeek V4 Flash が GPT/Gemini のフラッシュティアオプションよりも劇的に安価になり得ると論じています（ツイート）。ローカル側では、Hugging Face がローカルアプリに Hermes Agent サポートとネイティブのトレース可視化を追加しました。一方、@Teknium は Hermes Agent と CUA を介してあらゆるモデルでのコンピュータ操作をプレビューし、特にローカル/オープンモデルおよびフロンティア API を標的としています。OpenClaw および関連するオープンハネスにおけるローカルモデルの改善のために Hugging Face に参加した @onusoz の動きも、ローカルエージェントの使いやすさが今や戦略的なインフラストラクチャとなったことを示す強力なシグナルです。

ツールを巡る新たな設計思想として：@threepointone は、エージェントは漸近的に検索と実行という 2 つの原初的なツールのみを必要とするようになり、拡張し続ける静的なツールメニューではなく、機能の動的な意味論的発見を求めるようになるだろうと主張しました。これは、巨大なモノリシックプロンプトから構成可能なハネスへと移行するより広範な動きを補完するものです。

ベンチマーク、効率性、およびオープンモデル経済学

コーディングエージェントのベンチマークにおいて、ついにハーンネスとモデルのペアを測定する段階に至りました：Artificial Analysis が SWE-Bench-Pro-Hard-AA、Terminal-Bench v2、SWE-Atlas-QnA にわたる「Coding Agent Index」を発表し、単なるモデルだけでなく、モデルとハーンネス（注：評価環境・実行基盤）の組み合わせを比較しています。その主要な結果は、Cursor CLI での Opus 4.7 がスコア 61 を記録し、GPT-5.5 は Codex/Claude Code でこれに続いています。トップクラスのオープンウェイト設定には GLM-5.1、Kimi K2.6、DeepSeek V4 Pro（Claude Code 環境）が含まれており、依然として競争力がありますが、明確な差があります。また、このベンチマークは、タスクあたりのコスト（30 倍以上のばらつき）、トークン使用量（3 倍以上）、キャッシュヒット率（80–96%）、タスクあたりの処理時間（7 倍以上）における大きな変動も浮き彫りにしました。このベンチマークには、OpenHands のソフトウェアエンジニアリング向けベンチマーク更新発表（ツイート）と、オフィス業務、財務、ターミナル、Web タスクにわたるよりエージェント的なタスクミックスを持つ Claw-Eval が補完しています。Claw-Eval では MiMo-V2.5-Pro が首位を走り、DeepSeek V4 Flash はそのサイズに対して異例の効率性を示しました。

TurboQuant に対する懐疑論が高まっています：複数の投稿が、最近流行している量子化/サービング技術（注：モデル軽量化・推論高速化手法）について、より冷静な見解を示しています。@_EldarKurtic は、TurboQuant の精度、レイテンシ、スループットを網羅した初の包括的研究とされるものを提示し、@vllm_project は Red Hat と vLLM による調査をその出発点としてリンクしました。また、@jbhuang0604 は結論を「実際にはあまりうまく機能しない」と率直に要約しています。これはまさに、独立した再現が重要となるインフラに関する主張の典型例です。

ローカル/オープンモデルは、ハードウェアの限界を超えてハードウェアの制約よりも速く進化し続けています：@ClementDelangue はここで最も強力な高レベルの論拠を示しました。同じトップエンドMacBook Proのメモリ上限において、「実際に実行可能な最も賢いオープンウェイトモデル」は、Llama 3 70B時代の能力から、DeepSeek V4 Flash mixed-Q2 GGUF時代の能力へと、約24ヶ月で約4.7倍向上し、これは10.7ヶ月ごとに倍増するペースであり、ムーアの法則よりも速いです。GGUFアップロードの急成長に関する@victormustarのデータや、Qwen 3.6、Gemma 4、DeepSeek派生モデルが現在、非自明なエージェントタスクのためにローカルで利用可能であるという繰り返されるコミュニティの観察が、これを裏付けています。

研究ハイライト：MoEモジュラリティ、拡散/バイトモデル、およびエージェントダイナミクス

アーキテクチャと評価：AllenAIのEMOは、@TheTuringPostによって、ドキュメントレベルのルーティングが共有エキスパートプールを誘発する、よりモジュラーなMixture-of-Experts（専門家混合）設計として注目されました。特筆すべきは、同様のプルーニング条件下で標準的なMoEでは10〜15%の性能低下があるのに対し、EMOではエキスパートの25%のみを保持しても約1%の性能低下にとどまると報告されている点です（フォローアップ）。生成評価においては、@qberthetがFIDに代わるより高速でサンプル効率的な代替案として、MIND（Monge Inception Distance）を紹介しました。

Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.

Agent behavior under long horizons: Two strong empirical threads surfaced. First, "The Memory Curse" claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.

スケーリングと自己改善：WilliamBarrHeld 氏によって要約された Marin の Delphi スケーリング作業では、小規模な事前学習モデルから 250 億/6,000 億トークンの実行へと外挿する際に予測誤差が 0.2% であると主張されています。一方、omarsar0 氏は AutoTTS を紹介しました。これは LLM がテスト時のスケーリング制御空間自体を検索する仕組みで、発見コスト約 39.9 ドルで手動設計された戦略を上回る成果を報告しています。

エンゲージメント上位のツイート

OpenAI の企業向け・サービス展開：OpenAI が Deployment Company を設立し Tomoro を買収。150 名の FDE（フルタイムエンジニア）を採用。

OpenAI のセキュリティ製品化：Daybreak の発表と @sama による枠組みの提示。

Thinking Machines の相互作用モデル：Mira Murati 氏のローンチツイートおよび技術プレビュースレッド。

Artificial Analysis Coding Agent Index：ベンチマークの立ち上げと主要な発見結果。

エージェントツールリング/開発者ワークフロー：どのモデルでも使用可能な Hermes Agent のコンピュータ操作機能、Microsoft Teams 内の Cursor、Codex OpenAI Developers プラグイン。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 ローカル推論の進展

Unsloth における MTP（Activity: 620）：画像（リンク）は、Unsloth の Hugging Face プロフィールを示しており、新たに公開された MTP を維持する GGUF ビルドとして unsloth/Qwen3.6-27B-GGUF-MTP と unsloth/Qwen3.6-35B-A3B-GGUF-MTP がリストされています。この投稿の技術的な意義は、これらの GGUF ファイルが MTP / 次トークン予測層を保持している点ですが、ユーザーは標準的な llama.cpp のサポートに頼るのではなく、特定の llama.cpp MTP プルリクエスト（PR）をビルドする必要があります。あるコメントでは、27B GGUF でランタイム/アサート失敗が発生したと報告されています：GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0")。これは、メタデータの解析、モデル変換、または PR の互換性に関する未解決の問題が残っている可能性を示唆しています。コメントには、upstream llama.cpp の MTP サポートへの期待が反映されており、ユーザーは GitHub リポジトリを繰り返し確認し、「箱から出してすぐに」MTP がサポートされているかどうかを尋ねています。

新しい 27B GGUF モデルをコンパイルしたあるユーザーが、qwen35_mtp.cpp でランタイムアサートに遭遇しました：GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0")。これは、GGUF/モデルのメタデータまたは変換パスに、Qwen3.5 MTP の予測的/次トークン予測層に必要な nextn_predict_layers が欠落している可能性を示唆しています。

ある技術スレッドでは、GGUF における MTP（Multi-Token Prediction）サポートがローカル推論において重要であると指摘されており、特に 35B A3B バリアントについては、文脈長処理の改善に関連しているというコメントがあります。別の投稿者は、これが llama.cpp が今や「そのまま」MTP をサポートすることを意味するのかと問いかけ、そのサポートがマージされて安定版として利用可能なのか、それとも PR やフォークでのみ利用可能な状態なのかについて不確実性を示唆しています。

ある投稿者は、ik_llama の MTP 機能が現在、llama.cpp の PR よりも高速であると主張し、さらに Hadamard ベースの量子化（quants）をサポートしていると付け加えています。これは「turboquants」に似たものとして説明されており、ローカル MTP 推論バックエンドを比較するユーザーにとって潜在的に関連性の高い実装・パフォーマンス上の区別となります。

Qwen 3.6 35B A3B の熱狂は本物です!!! (アクティビティ: 586): この投稿は、いくつかの小型/ローカル長文脈オープンウェイトモデル—Qwen 3.6 35B A3B、Qwen 3.6 27B、Gemma 4 26B A4B、そしてNemotron 3 Nano—が学術論文と対応する研究コードを与えられ、実装の詳細を論文にマッピングするよう求められたという定性的なコード理解評価について報告しています。著者の詳細なノートは、この GitHub README にあります。主な主張は、gated delta netやハイブリッド M などの新しい長文脈メカニズムが鍵となる点です。

⟦CODE_0⟧

原文を表示

a quiet day.

AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Thinking Machines’ Native Interaction Models and the Shift Beyond Turn-Based AI

Full-duplex multimodal interaction as a first-class model capability: The day’s clearest technical theme was Thinking Machines’ preview of “interaction models”, described as models trained from scratch for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying technical post and team commentary from @johnschulman2, @soumithchintala, and @cHHillee frame this as a human↔AI bandwidth problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit “now I’m thinking / now I’m searching” boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous audio+video+text → audio+text (@johnschulman2).

Why it matters technically: Several reactions converged on the same point: this is not “another chatbot demo” but a change in interface assumptions. @liliyu_lili pointed to visual proactivity (“tell me when I start slouching”, “count my pushups”) as a missing primitive in current systems; @rown called it the first general video+speech model that is visually proactive; @kimmonismus and @giffmana both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for “realtime” multimodal systems, as noted by @swyx. One implementation detail surfaced via @eliebakouch: the stack is using SGLang.

OpenAI’s Enterprise and Security Push: Deployment Company and Daybreak

OpenAI is moving down-stack into services and deployment: OpenAI announced the OpenAI Deployment Company, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is 150 Forward Deployed Engineers and Deployment Specialists coming in via the acquisition of Tomoro, with @gdb citing $4B of initial investment from 19 partners. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: @kimmonismus argued OpenAI wants to own the deployment layer of the AI economy, while @matvelloso connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.

Daybreak: security-specific model distribution, workflow, and trust tiers: OpenAI also launched Daybreak, an umbrella effort around defensive cyber operations and continuously securing software, with @sama positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by @TheRundownAI, combines GPT-5.5, Codex, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including Trusted Access for Cyber and a more specialized GPT-5.5-Cyber. This stands in contrast to Anthropic’s more restrictive cyber posture, a tension captured by @kimmonismus. For teams building secure agent systems, a separate warning from @lukOlejnik is relevant: “Your LLM is not a security boundary”—Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.

Agent Harnesses, Local-First Tooling, and Control Surfaces

Better agent control planes are becoming a product category: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. @itsclelia addressed this with aggit, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, @_catwu highlighted a new claude agents terminal control plane for managing multiple Claude Code agents, and @cursor_ai pushed Cursor into Microsoft Teams, where the agent reads the full thread and opens a PR. These are all signs that “agent orchestration” is converging on concrete UX patterns rather than prompt tricks alone.

Deep Agents / Hermes / local agents are maturing quickly: @masondrxy noted that Deep Agents CLI can hot-swap underlying model providers mid-conversation without losing context, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted harness profiles for provider/model-specific tuning (tweet), and separate pricing analysis from the same author argued that DeepSeek V4 Flash can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (tweet). On the local side, Hugging Face added Hermes Agent support in local apps plus native trace visualization, while @Teknium previewed computer use with any model via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. @onusoz joining Hugging Face to improve local models in OpenClaw and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.

A design thesis emerging around tools: @threepointone argued that agents may asymptotically want just two primitive tools: search and execute, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.

Benchmarks, Efficiency, and Open-Model Economics

Coding-agent benchmarking is finally measuring harness+model pairs: Artificial Analysis launched a Coding Agent Index spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but model+harness combinations. Their topline: Opus 4.7 in Cursor CLI scored 61, with GPT-5.5 in Codex/Claude Code close behind; top open-weight setups included GLM-5.1, Kimi K2.6, and DeepSeek V4 Pro in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in cost per task (>30x), token usage (>3x), cache hit rates (80–96%), and time per task (>7x). That benchmark was complemented by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s more agentic task mix across office, finance, terminal, and web tasks, where MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size.

TurboQuant skepticism is increasing: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. @_EldarKurtic presented what he described as the first comprehensive study of TurboQuant, covering accuracy, latency, and throughput; @vllm_project linked the Red Hat / vLLM investigation as a starting point; and @jbhuang0604 bluntly summarized the takeaway as “it doesn’t really work well.” This is exactly the sort of infra claim where independent reproduction matters.

Local/open models continue to improve faster than hardware ceilings: @ClementDelangue made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the “smartest open-weight model you can actually run” improved from Llama 3 70B-era capability to DeepSeek V4 Flash mixed-Q2 GGUF-era capability at roughly 4.7x in 24 months, implying a doubling every 10.7 months, faster than Moore’s Law. Supporting datapoints came from @victormustar on the rapid growth of GGUF uploads and from repeated community observations that Qwen 3.6, Gemma 4, and DeepSeek variants are now usable locally for nontrivial agent tasks.

Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics

Architectures and evaluation: AllenAI’s EMO was highlighted by @TheTuringPost as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only 25% of experts reportedly costs just ~1% performance versus 10–15% degradation in standard MoEs under similar pruning (follow-up). On generative evaluation, @qberthet introduced MIND (Monge Inception Distance) as a purportedly faster, more sample-efficient replacement for FID.

Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.

Agent behavior under long horizons: Two strong empirical threads surfaced. First, “The Memory Curse” claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.

Scaling and self-improvement: Marin’s Delphi scaling work, summarized by @WilliamBarrHeld, claims a 0.2% prediction error when extrapolating from small pretrains to a 25B / 600B token run. Separately, @omarsar0 highlighted AutoTTS, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about $39.9 of discovery cost.

Top tweets (by engagement)

OpenAI’s enterprise/services move: OpenAI launches the Deployment Company and Tomoro acquisition / 150 FDEs.

OpenAI’s security productization: Daybreak announcement and @sama’s framing.

Thinking Machines’ interaction models: Mira Murati’s launch tweet and the technical preview thread.

Artificial Analysis Coding Agent Index: benchmark launch and topline findings.

Agent tooling / developer workflow: Hermes Agent computer use with any model, Cursor in Microsoft Teams, and Codex OpenAI Developers plugin.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

MTP on Unsloth (Activity: 620): The image (link) shows Unsloth’s Hugging Face profile listing newly published MTP-preserving GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The post’s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved. Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported “out of the box.”

A user compiling the new 27B GGUF model hit a runtime assert in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"). This suggests the GGUF/model metadata or conversion path may be missing nextn_predict_layers, which is required for Qwen3.5 MTP speculative/next-token prediction layers.

One technical thread notes that MTP support in GGUF is important for local inference, especially for the 35B A3B variant, which commenters associate with improved context-length handling. Another commenter asks whether this means llama.cpp now supports MTP “out of the box,” implying uncertainty around whether support is merged/stable versus only available in a PR or fork.

A commenter claims ik_llama MTP is currently faster than the llama.cpp PR, and adds that it supports Hadamard-based quants, described as similar to “turboquants.” This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.

The Qwen 3.6 35B A3B hype is real!!! (Activity: 586): The post reports a qualitative code-understanding eval where several small/local long-context open-weight models—Qwen 3.6 35B A3B, Qwen 3.6 27B, Gemma 4 26B A4B, and Nemotron 3 Nano—were given an academic paper plus corresponding research code and asked to map implementation details back to the paper; the author’s detailed notes are in this GitHub README. The key claim is that newer long-context mechanisms such as gated delta net, hybrid M

この記事をシェア

KDnuggets重要度42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

Vercel Blog重要度42026年6月25日 16:00

AI SDK 7 が利用可能に

MarkTechPost重要度42026年6月25日 05:00

Gradium、リアルタイム音声翻訳モデル「stt-translate」と「s2s-translate」を公開し、精度と遅延で競合を上回る

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年5月11日 14:44·約17分

本日は特に目立った出来事なし

#インタラクションモデル #マルチモーダル AI #視覚的プロアクティブ性 #Thinking Machines #リアルタイム処理

TL;DR

AI深層分析2026年5月12日 14:03

重要/ 5段階

深度40%

キーポイント

ネイティブ双方向インタラクションモデルの登場

連続時間意識と並列処理の実現

視覚的プロアクティブ性の付与

現在のシステムには欠けていた「姿勢が悪いと指摘する」「腕立て伏せを数える」ような、環境を監視し自発的に介入する能力が実装された。

専用システムの不要化（ゼロショット）

John Schulman 氏らによると、この新しい型シグネチャにより、以前は特殊なシステムが必要だったタスクが、特別な設定なしで処理可能になる。

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI ツイートリキャップ

Thinking Machines のネイティブ相互作用モデルとターンベース型AIを超えた転換**

フルデュプレックス型マルチモーダル相互作用を第一級のモデル機能として：本日の最も明確な技術的テーマは、Thinking Machines が発表した「インタラクションモデル」のプレビューでした。これは、音声やターン制の切り替え、ツール利用などを既存のターンベース LLM に重ねるのではなく、リアルタイム相互作用のためにゼロから訓練されたモデルと説明されています。これに伴う技術記事および @johnschulman2、@soumithchintala、@cHHillee によるチームの見解は、これを人間↔AI の帯域幅の問題として捉えています：モデルは同時に聴き、話し、見、考え、検索し、反応できるべきです。デモでは、連続時間における意識の維持、割り込み処理、同時発話、視覚的な能動性、そして明示的な「今考えている／今検索している」という境界線なしでのバックグラウンドツール利用が強調されました。チームメンバーはまた、以前は専用システムを必要としていた多くのタスクが、タイプシグネチャが実質的に連続するオーディオ＋ビデオ＋テキスト → オーディオ＋テキストとなることでゼロショットで実行可能になる点も指摘しました (@johnschulman2)。

技術的になぜ重要か：複数の反応が同じ点を指摘した。これは「別のチャットボットのデモ」ではなく、インターフェースの前提条件そのものの変化である。@liliyu_lili は、現在のシステムに欠けているプリミティブとして視覚的な能動性（「姿勢が悪くなったら教えて」「腕立て伏せを数えて」といった機能）を挙げた。@rown はこれを、視覚的に能動的な最初の一般向け動画＋音声モデルと呼んだ。@kimmonismus と@giffmana の両者は、生来的な相互操作性こそが、単なるベンチマークの数値以上の深い革新であると強調した。この発表はまた、@swyx が指摘するように、「リアルタイム」マルチモーダルシステムに対する基準を暗黙的に引き上げる結果となった。実装の詳細については、@eliebakouch によって SGLang を使用しているスタックであることが明らかになった。

OpenAI のエンタープライズおよびセキュリティへの取り組み：デプロイメント会社と Daybreak

OpenAI は下流のサービスやデプロイメントへと進出している。OpenAI は、最先端モデルを実際のワークフローに展開する企業を支援するために構築された、過半数を保有する子会社「OpenAI Deployment Company」を発表した。重要な運用上の詳細は、Tomoro の買収を通じて 150 名のフォワードデプロイエンジニアおよびデプロイメントスペシャリストが加わることである。@gdb は、これに 19 社のパートナーから初期投資として 40 億ドルが投入されたことを引用している。複数の観察者はこれを、OpenAI が Palantir や Microsoft のようなフィールドエンジニアリングモデルを採用したと解釈している。@kimmonismus は、OpenAI が AI エコノミーにおけるデプロイメント層を支配したいと考えていると論じ、@matvelloso は、技術スタッフを顧客の運用現場に密着させるという歴史的なエンタープライズ成功のパターンへとこれを結びつけた。

⟦CODE_0⟧

技術的になぜ重要か：複数の反応が同じ点を指摘した。これは「別のチャットボットのデモ」ではなく、インターフェースの前提条件そのものの変化である。@liliyu_lili は、現在のシステムに欠けているプリミティブとして視覚的な能動性（「姿勢が悪くなったら教えて」「腕立て伏せを数えて」といった機能）を挙げた。@rown はこれを、視覚的に能動的な最初の一般向け動画＋音声モデルと呼んだ。@kimmonismus と@giffmana の両者は、生来的な相互操作性こそが、単なるベンチマークの数値以上の深い革新であると強調した。この発表はまた、@swyx が指摘するように、「リアルタイム」マルチモーダルシステムに対する基準を暗黙的に引き上げる結果となった。実装の詳細については、@eliebakouch によって SGLang を使用しているスタックであることが明らかになった。

OpenAI のエンタープライズおよびセキュリティへの取り組み：デプロイメント会社と Daybreak

OpenAI は下流のサービスやデプロイメントへと進出している。OpenAI は、最先端モデルを実際のワークフローに展開する企業を支援するために構築された、過半数を保有する子会社「OpenAI Deployment Company」を発表した。重要な運用上の詳細は、Tomoro の買収を通じて 150 名のフォワードデプロイエンジニアおよびデプロイメントスペシャリストが加わることである。@gdb は、これに 19 社のパートナーから初期投資として 40 億ドルが投入されたことを引用している。複数の観察者はこれを、OpenAI が Palantir や Microsoft のようなフィールドエンジニアリングモデルを採用したと解釈している。@kimmonismus は、OpenAI が AI エコノミーにおけるデプロイメント層を支配したいと考えていると論じ、@matvelloso は、技術スタッフを顧客の運用現場に密着させるという歴史的なエンタープライズ成功のパターンへとこれを結びつけた。

⟦CODE_1⟧

Daybreak: セキュリティ固有のモデル配布、ワークフロー、信頼ティア：OpenAI はまた、防御的なサイバー運用とソフトウェアの継続的なセキュリティを軸とした包括的な取り組み「Daybreak」を発表し、@sama 氏はこれを急速に向上する AI サイバー能力に対する実用的な対応として位置付けています。@TheRundownAI が要約した製品提案は、GPT-5.5、Codex、リポジトリ脅威モデリング、脆弱性発見、パッチ生成、レスポンス自動化を組み合わせたものであり、サイバー向け信頼アクセス（Trusted Access for Cyber）やより専門的な GPT-5.5-Cyber といった差別化されたアクセスティアを含んでいます。これは Anthropic のより制限的なサイバー姿勢とは対照的であり、この緊張関係は @kimmonismus によって捉えられています。安全なエージェントシステムを構築するチームにとって、@lukOlejnik からの別の警告が関連します。「あなたの LLM はセキュリティ境界ではない」— Microsoft Semantic Kernel では、フレームワークがモデル自体の失敗ではなくモデル出力を過信したため、プロンプトインジェクションがホストレベルの RCE（リモートコード実行）に転換された reportedly 報告されています。

エージェントハネス、ローカルファーストツールリング、およびコントロールサーフェス

より優れたエージェント制御プレーンが製品カテゴリとなりつつある：有用なエージェントには自律性が求められる一方で、エンジニアは可逆的で検証可能な制御を依然として望むという不満が繰り返し指摘されています。@itsclelia は aggit という Rust CLI でこれに対処しました。これはローカル/リモート環境向けで S3 ベースのストレージを利用するもので、メインの Git 履歴外でstash/branch/restore のセマンティクスを可能にします。同様の文脈で、@_catwu は複数の Claude Code エージェントを管理するための新しい Claude エージェント用ターミナル制御プレーンを紹介し、@cursor_ai は Cursor を Microsoft Teams に統合しました。これによりエージェントはスレッド全体を読み込み、プルリクエストを作成します。これらはすべて、「エージェントオーケストレーション」がプロンプトの技量だけでなく、具体的な UX パターンへと収束している兆候です。

Deep Agents / Hermes / ローカルエージェントは急速に成熟しています：@masondrxy は、Deep Agents CLI が会話中にコンテキストを失うことなく基盤となるモデルプロバイダーをホットスワップできることを指摘しました。これは多くのエージェントスタックがまだ欠いている非自明なシステム機能です。LangChain もまた、プロバイダー/モデル固有のチューニングのためのハネスプロファイル（ツイート）を強調し、同じ著者による別の価格分析では、高ボリュームのエージェントワークロードにおいて DeepSeek V4 Flash が GPT/Gemini のフラッシュティアオプションよりも劇的に安価になり得ると論じています（ツイート）。ローカル側では、Hugging Face がローカルアプリに Hermes Agent サポートとネイティブのトレース可視化を追加しました。一方、@Teknium は Hermes Agent と CUA を介してあらゆるモデルでのコンピュータ操作をプレビューし、特にローカル/オープンモデルおよびフロンティア API を標的としています。OpenClaw および関連するオープンハネスにおけるローカルモデルの改善のために Hugging Face に参加した @onusoz の動きも、ローカルエージェントの使いやすさが今や戦略的なインフラストラクチャとなったことを示す強力なシグナルです。

ツールを巡る新たな設計思想として：@threepointone は、エージェントは漸近的に検索と実行という 2 つの原初的なツールのみを必要とするようになり、拡張し続ける静的なツールメニューではなく、機能の動的な意味論的発見を求めるようになるだろうと主張しました。これは、巨大なモノリシックプロンプトから構成可能なハネスへと移行するより広範な動きを補完するものです。

ベンチマーク、効率性、およびオープンモデル経済学

コーディングエージェントのベンチマークにおいて、ついにハーンネスとモデルのペアを測定する段階に至りました：Artificial Analysis が SWE-Bench-Pro-Hard-AA、Terminal-Bench v2、SWE-Atlas-QnA にわたる「Coding Agent Index」を発表し、単なるモデルだけでなく、モデルとハーンネス（注：評価環境・実行基盤）の組み合わせを比較しています。その主要な結果は、Cursor CLI での Opus 4.7 がスコア 61 を記録し、GPT-5.5 は Codex/Claude Code でこれに続いています。トップクラスのオープンウェイト設定には GLM-5.1、Kimi K2.6、DeepSeek V4 Pro（Claude Code 環境）が含まれており、依然として競争力がありますが、明確な差があります。また、このベンチマークは、タスクあたりのコスト（30 倍以上のばらつき）、トークン使用量（3 倍以上）、キャッシュヒット率（80–96%）、タスクあたりの処理時間（7 倍以上）における大きな変動も浮き彫りにしました。このベンチマークには、OpenHands のソフトウェアエンジニアリング向けベンチマーク更新発表（ツイート）と、オフィス業務、財務、ターミナル、Web タスクにわたるよりエージェント的なタスクミックスを持つ Claw-Eval が補完しています。Claw-Eval では MiMo-V2.5-Pro が首位を走り、DeepSeek V4 Flash はそのサイズに対して異例の効率性を示しました。

TurboQuant に対する懐疑論が高まっています：複数の投稿が、最近流行している量子化/サービング技術（注：モデル軽量化・推論高速化手法）について、より冷静な見解を示しています。@_EldarKurtic は、TurboQuant の精度、レイテンシ、スループットを網羅した初の包括的研究とされるものを提示し、@vllm_project は Red Hat と vLLM による調査をその出発点としてリンクしました。また、@jbhuang0604 は結論を「実際にはあまりうまく機能しない」と率直に要約しています。これはまさに、独立した再現が重要となるインフラに関する主張の典型例です。

ローカル/オープンモデルは、ハードウェアの限界を超えてハードウェアの制約よりも速く進化し続けています：@ClementDelangue はここで最も強力な高レベルの論拠を示しました。同じトップエンドMacBook Proのメモリ上限において、「実際に実行可能な最も賢いオープンウェイトモデル」は、Llama 3 70B時代の能力から、DeepSeek V4 Flash mixed-Q2 GGUF時代の能力へと、約24ヶ月で約4.7倍向上し、これは10.7ヶ月ごとに倍増するペースであり、ムーアの法則よりも速いです。GGUFアップロードの急成長に関する@victormustarのデータや、Qwen 3.6、Gemma 4、DeepSeek派生モデルが現在、非自明なエージェントタスクのためにローカルで利用可能であるという繰り返されるコミュニティの観察が、これを裏付けています。

研究ハイライト：MoEモジュラリティ、拡散/バイトモデル、およびエージェントダイナミクス

アーキテクチャと評価：AllenAIのEMOは、@TheTuringPostによって、ドキュメントレベルのルーティングが共有エキスパートプールを誘発する、よりモジュラーなMixture-of-Experts（専門家混合）設計として注目されました。特筆すべきは、同様のプルーニング条件下で標準的なMoEでは10〜15%の性能低下があるのに対し、EMOではエキスパートの25%のみを保持しても約1%の性能低下にとどまると報告されている点です（フォローアップ）。生成評価においては、@qberthetがFIDに代わるより高速でサンプル効率的な代替案として、MIND（Monge Inception Distance）を紹介しました。

Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.

Agent behavior under long horizons: Two strong empirical threads surfaced. First, "The Memory Curse" claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.

スケーリングと自己改善：WilliamBarrHeld 氏によって要約された Marin の Delphi スケーリング作業では、小規模な事前学習モデルから 250 億/6,000 億トークンの実行へと外挿する際に予測誤差が 0.2% であると主張されています。一方、omarsar0 氏は AutoTTS を紹介しました。これは LLM がテスト時のスケーリング制御空間自体を検索する仕組みで、発見コスト約 39.9 ドルで手動設計された戦略を上回る成果を報告しています。

エンゲージメント上位のツイート

OpenAI の企業向け・サービス展開：OpenAI が Deployment Company を設立し Tomoro を買収。150 名の FDE（フルタイムエンジニア）を採用。

OpenAI のセキュリティ製品化：Daybreak の発表と @sama による枠組みの提示。

Thinking Machines の相互作用モデル：Mira Murati 氏のローンチツイートおよび技術プレビュースレッド。

Artificial Analysis Coding Agent Index：ベンチマークの立ち上げと主要な発見結果。

エージェントツールリング/開発者ワークフロー：どのモデルでも使用可能な Hermes Agent のコンピュータ操作機能、Microsoft Teams 内の Cursor、Codex OpenAI Developers プラグイン。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 ローカル推論の進展

Unsloth における MTP（Activity: 620）：画像（リンク）は、Unsloth の Hugging Face プロフィールを示しており、新たに公開された MTP を維持する GGUF ビルドとして unsloth/Qwen3.6-27B-GGUF-MTP と unsloth/Qwen3.6-35B-A3B-GGUF-MTP がリストされています。この投稿の技術的な意義は、これらの GGUF ファイルが MTP / 次トークン予測層を保持している点ですが、ユーザーは標準的な llama.cpp のサポートに頼るのではなく、特定の llama.cpp MTP プルリクエスト（PR）をビルドする必要があります。あるコメントでは、27B GGUF でランタイム/アサート失敗が発生したと報告されています：GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0")。これは、メタデータの解析、モデル変換、または PR の互換性に関する未解決の問題が残っている可能性を示唆しています。コメントには、upstream llama.cpp の MTP サポートへの期待が反映されており、ユーザーは GitHub リポジトリを繰り返し確認し、「箱から出してすぐに」MTP がサポートされているかどうかを尋ねています。

ある技術スレッドでは、GGUF における MTP（Multi-Token Prediction）サポートがローカル推論において重要であると指摘されており、特に 35B A3B バリアントについては、文脈長処理の改善に関連しているというコメントがあります。別の投稿者は、これが llama.cpp が今や「そのまま」MTP をサポートすることを意味するのかと問いかけ、そのサポートがマージされて安定版として利用可能なのか、それとも PR やフォークでのみ利用可能な状態なのかについて不確実性を示唆しています。

ある投稿者は、ik_llama の MTP 機能が現在、llama.cpp の PR よりも高速であると主張し、さらに Hadamard ベースの量子化（quants）をサポートしていると付け加えています。これは「turboquants」に似たものとして説明されており、ローカル MTP 推論バックエンドを比較するユーザーにとって潜在的に関連性の高い実装・パフォーマンス上の区別となります。

⟦CODE_0⟧

原文を表示

a quiet day.

AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Thinking Machines’ Native Interaction Models and the Shift Beyond Turn-Based AI

Full-duplex multimodal interaction as a first-class model capability: The day’s clearest technical theme was Thinking Machines’ preview of “interaction models”, described as models trained from scratch for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying technical post and team commentary from @johnschulman2, @soumithchintala, and @cHHillee frame this as a human↔AI bandwidth problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit “now I’m thinking / now I’m searching” boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous audio+video+text → audio+text (@johnschulman2).

Why it matters technically: Several reactions converged on the same point: this is not “another chatbot demo” but a change in interface assumptions. @liliyu_lili pointed to visual proactivity (“tell me when I start slouching”, “count my pushups”) as a missing primitive in current systems; @rown called it the first general video+speech model that is visually proactive; @kimmonismus and @giffmana both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for “realtime” multimodal systems, as noted by @swyx. One implementation detail surfaced via @eliebakouch: the stack is using SGLang.

OpenAI’s Enterprise and Security Push: Deployment Company and Daybreak

OpenAI is moving down-stack into services and deployment: OpenAI announced the OpenAI Deployment Company, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is 150 Forward Deployed Engineers and Deployment Specialists coming in via the acquisition of Tomoro, with @gdb citing $4B of initial investment from 19 partners. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: @kimmonismus argued OpenAI wants to own the deployment layer of the AI economy, while @matvelloso connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.

Daybreak: security-specific model distribution, workflow, and trust tiers: OpenAI also launched Daybreak, an umbrella effort around defensive cyber operations and continuously securing software, with @sama positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by @TheRundownAI, combines GPT-5.5, Codex, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including Trusted Access for Cyber and a more specialized GPT-5.5-Cyber. This stands in contrast to Anthropic’s more restrictive cyber posture, a tension captured by @kimmonismus. For teams building secure agent systems, a separate warning from @lukOlejnik is relevant: “Your LLM is not a security boundary”—Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.

Agent Harnesses, Local-First Tooling, and Control Surfaces

Better agent control planes are becoming a product category: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. @itsclelia addressed this with aggit, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, @_catwu highlighted a new claude agents terminal control plane for managing multiple Claude Code agents, and @cursor_ai pushed Cursor into Microsoft Teams, where the agent reads the full thread and opens a PR. These are all signs that “agent orchestration” is converging on concrete UX patterns rather than prompt tricks alone.

Deep Agents / Hermes / local agents are maturing quickly: @masondrxy noted that Deep Agents CLI can hot-swap underlying model providers mid-conversation without losing context, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted harness profiles for provider/model-specific tuning (tweet), and separate pricing analysis from the same author argued that DeepSeek V4 Flash can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (tweet). On the local side, Hugging Face added Hermes Agent support in local apps plus native trace visualization, while @Teknium previewed computer use with any model via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. @onusoz joining Hugging Face to improve local models in OpenClaw and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.

A design thesis emerging around tools: @threepointone argued that agents may asymptotically want just two primitive tools: search and execute, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.

Benchmarks, Efficiency, and Open-Model Economics

Coding-agent benchmarking is finally measuring harness+model pairs: Artificial Analysis launched a Coding Agent Index spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but model+harness combinations. Their topline: Opus 4.7 in Cursor CLI scored 61, with GPT-5.5 in Codex/Claude Code close behind; top open-weight setups included GLM-5.1, Kimi K2.6, and DeepSeek V4 Pro in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in cost per task (>30x), token usage (>3x), cache hit rates (80–96%), and time per task (>7x). That benchmark was complemented by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s more agentic task mix across office, finance, terminal, and web tasks, where MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size.

TurboQuant skepticism is increasing: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. @_EldarKurtic presented what he described as the first comprehensive study of TurboQuant, covering accuracy, latency, and throughput; @vllm_project linked the Red Hat / vLLM investigation as a starting point; and @jbhuang0604 bluntly summarized the takeaway as “it doesn’t really work well.” This is exactly the sort of infra claim where independent reproduction matters.

Local/open models continue to improve faster than hardware ceilings: @ClementDelangue made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the “smartest open-weight model you can actually run” improved from Llama 3 70B-era capability to DeepSeek V4 Flash mixed-Q2 GGUF-era capability at roughly 4.7x in 24 months, implying a doubling every 10.7 months, faster than Moore’s Law. Supporting datapoints came from @victormustar on the rapid growth of GGUF uploads and from repeated community observations that Qwen 3.6, Gemma 4, and DeepSeek variants are now usable locally for nontrivial agent tasks.

Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics

Architectures and evaluation: AllenAI’s EMO was highlighted by @TheTuringPost as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only 25% of experts reportedly costs just ~1% performance versus 10–15% degradation in standard MoEs under similar pruning (follow-up). On generative evaluation, @qberthet introduced MIND (Monge Inception Distance) as a purportedly faster, more sample-efficient replacement for FID.

Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.

Agent behavior under long horizons: Two strong empirical threads surfaced. First, “The Memory Curse” claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.

Scaling and self-improvement: Marin’s Delphi scaling work, summarized by @WilliamBarrHeld, claims a 0.2% prediction error when extrapolating from small pretrains to a 25B / 600B token run. Separately, @omarsar0 highlighted AutoTTS, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about $39.9 of discovery cost.

Top tweets (by engagement)

OpenAI’s enterprise/services move: OpenAI launches the Deployment Company and Tomoro acquisition / 150 FDEs.

OpenAI’s security productization: Daybreak announcement and @sama’s framing.

Thinking Machines’ interaction models: Mira Murati’s launch tweet and the technical preview thread.

Artificial Analysis Coding Agent Index: benchmark launch and topline findings.

Agent tooling / developer workflow: Hermes Agent computer use with any model, Cursor in Microsoft Teams, and Codex OpenAI Developers plugin.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

MTP on Unsloth (Activity: 620): The image (link) shows Unsloth’s Hugging Face profile listing newly published MTP-preserving GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The post’s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved. Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported “out of the box.”

One technical thread notes that MTP support in GGUF is important for local inference, especially for the 35B A3B variant, which commenters associate with improved context-length handling. Another commenter asks whether this means llama.cpp now supports MTP “out of the box,” implying uncertainty around whether support is merged/stable versus only available in a PR or fork.

A commenter claims ik_llama MTP is currently faster than the llama.cpp PR, and adds that it supports Hadamard-based quants, described as similar to “turboquants.” This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.

この記事をシェア

KDnuggets重要度42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

Vercel Blog重要度42026年6月25日 16:00

AI SDK 7 が利用可能に

MarkTechPost重要度42026年6月25日 05:00

Gradium、リアルタイム音声翻訳モデル「stt-translate」と「s2s-translate」を公開し、精度と遅延で競合を上回る

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 ローカル推論の進展

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 ローカル推論の進展

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

関連記事