Smol AI News·2026年5月13日 14:44·約18分

本日は特に目立った出来事なし

#Agent Infrastructure #Observability #State Management #Open Source #LangChain

TL;DR

Cline、LangChain、Notion、Cursor が次世代エージェント基盤の整備を加速させ、チャット中心から長期状態管理とオーケストレーションへの設計転換が明確になった。

AI深層分析2026年5月14日 10:03

重要/ 5段階

深度40%

キーポイント

エージェント基盤の深化とオープンソース化

Cline が SDK を再構築してカスタムコーディングエージェントのための再利用可能なサブストレートとなり、LangChain は LangSmith Engine や SmithDB などのライフサイクルインフラを大規模にリリースした。

次世代観測データベースの登場

LangChain の「SmithDB」はネストされた長期実行トレースと大容量ペイロードに特化した専用データベースであり、Apache DataFusion と Vortex を基盤に 12–15 倍の高速アクセスを実現している。

エージェント UX の設計転換

業界全体で「チャット」から「長期状態管理」「ストリーミング」「オーケストレーション」へと重点が移っており、Duet Agent や LangChain の更新は数週間〜数ヶ月にわたるジョブの協調とメモリ管理を提案している。

エコシステム統合とセキュリティ強化

Notion が外部エージェント API を公開して他社ツールとの文脈共有を実現し、Cursor はクローンリポジトリや分離されたシークレットを含む完全なクラウド開発環境を提供している。

影響分析・編集コメントを表示

影響分析

このニュースは、AI エージェントが実験的なプロトタイプから、複雑で長期間にわたる実務タスクを処理できる堅牢なインフラへと成熟したことを示す重要な転換点です。特に、長期実行時の状態管理と観測性を解決する技術的アプローチ（SmithDB や State-machine harness）の確立は、エンタープライズレベルでのエージェント導入におけるボトルネック解消に寄与します。

編集コメント

「何もない日」というタイトルとは裏腹に、エージェント基盤の技術的土台が劇的に再構築されている内容です。特に長期実行タスクを扱うための観測性データベースの登場は、実務での信頼性を高める決定的な一手と言えます。

静かな一日でした。

2026年5月12日〜13日のAIニュース。12のサブレッド、544件のツイート、およびDiscord（追加情報なし）を確認しました。AINews のウェブサイトでは過去のすべての号を検索できます。念のため、AINews は現在 Latent Space のセクションの一部となっています。メール配信頻度のオプトイン・オプトアウトが可能です！

AI ツイートリキャップ

エージェントインフラ、ハーネス（Harnesses）、および開発者プラットフォーム

Cline、LangChain、Notion、Cursor はすべてエージェントプラットフォームの領域にさらに深く進出しました：Cline は再構築された Cline SDK をオープンソース化し、TUI（ターミナルユーザーインターフェース）を備えた CLI を刷新して、エージェントチーム、スケジュールジョブ、コネクタ機能を追加し、そのハルネスをカスタムコーディングエージェントのための再利用可能な基盤として位置づけました。LangChain は Interrupt でエージェントライフサイクルインフラストラクチャの大量リリースを行いました：LangSmith Engine、SmithDB、サンドボックス、マネージドディープエージェント、LLM ゲートウェイ、コンテキストハブ、そして Deep Agents 0.6 です。技術的に最も注目すべき点は SmithDB で、これはネストされた長時間実行トレースや大規模ペイロードを対象とした専用観測性データベースであり、主要なワークロードで 12～15 倍の高速アクセスを実現したと報告されています。チームによると、これは Apache DataFusion と Vortex を基盤に構築されています。並行して、Notion の External Agents API により、Claude、Codex、Cursor、Decagon、Warp、Devin などのサードパーティ製エージェントが、別のサイロとしてではなく、共有かつレビュー可能なコンテキストレイヤーとして Notion 内で直接動作できるようになりました。Cursor はクラウドエージェントを拡張し、クローンされたリポジトリ、依存関係、バージョン履歴、ロールバック機能、スコープ付きの外部通信制限（egress）、隔離されたシークレットを含む完全に構成された開発環境を提供しています。

エージェント UX は、チャットよりもむしろ長期間実行される状態の管理、ストリーミング、オーケストレーション increasingly 重要になっています：複数のリリースが同じデザイン方向に収束しました。Duet Agent は数週間から数ヶ月続くジョブのためのステートマシンハネスを提案し、親/子エージェント間の調整とメモリ管理により、従来の圧縮（compaction）に代わるアプローチを採用しています。LangChain のオープンソースアップデートでは、ストリーミング型投影（typed projections）、チェックポイント保存機能、コードインタプリタ、ハネスプロファイル、モデル固有のチューニングが追加され、単なるトークンではなくより豊かなエージェントイベントストリームを実現することを目指しています。Tabracadabra は補完機能からあらゆるテキストボックスで文脈を認識するアシスタントへと進化し、VS Code では「Agents」ウィンドウが導入され、複数プロジェクトにわたるタスクレビュー機能が強化されました。これらのリリース全体を通じて示されるアーキテクチャ上のメッセージは、本番環境のエージェントには、状態レスなプロンプト/レスポンスループではなく、永続的な実行機能、検証可能な中間状態、そしてツールネイティブな UI サフェースがますます必要となっているということです。

モデルトレーニング、アーキテクチャ、およびデータ効率性

プリトレーニングの効率化とアーキテクチャの実験が最も強い研究の継続線でした：Nous Research の Token Superposition Training は、プリトレーニングの初期段階を修正し、モデルが標準的な次トークン予測に戻る前に連続するトークンのバッチを読み取り・予測できるようにします。彼らは、FLOPs を同等に保ちながら推論時のアーキテクチャ変更なしで 2～3 倍の壁時計速度向上を報告しており、これは 270M から 3B の密集型モデルおよび 10B-A1B の MoE モデルで検証されています。Jonas Geiping らは、現在のメッセージベース/チャットトレーニングがエージェントを単一のストリームに過度に制約していると主張し、低レイテンシ、関心の明確な分離、より読みやすい並列推論・ツール使用を実現するマルチストリーム LLM の論文を発表しました。論文とコードはこちらでリンクされています。δ-mem は、凍結されたフルアテンションバックボーンに外部オンライン連想メモリを接続する提案を行い、8×8 の状態が平均スコアを 1.10 倍向上させ、非δ-mem ベースラインを 1.15 倍上回ると報告されています。また、メモリー依存度の高いベンチマークではより大きな改善が見られます。

学習後・圧縮およびデータキュレーションにおいても注目すべき成果が生まれています：NVIDIA の Star Elastic は、1 回の学習後実行で推論モデルのサイズファミリーを導出可能であり、そのコストは事前学習によるファミリー構築の 360 分の 1 で、SOTA（State-of-the-Art）圧縮手法よりも 7 倍優れています。Siddharth Joshi と Pratyush Maini が紹介した Datology の VLM（Vision-Language Model：視覚言語モデル）研究では、データキュレーション単体でも主要なマルチモーダル性能の向上が可能であると主張しています。具体的には、2B パラメータ規模で 20 の公開 VLM ベンチマーク全体で +11.7 ポイントの向上を達成し、InternVL3.5-2B を約 10 ポイント上回っています。これは学習計算量が約 17 分の 1 で実現されたものであり、Qwen3-VL-4B と比較して応答 FLOPs（Floating Point Operations：浮動小数点演算）が 3.3 倍少ないにもかかわらず、最前線に近い 4B パラメータ規模の性能を達成しています。オープンデータ側では、Percy Liang が次の Marin ランにはすでに 18T トークンが含まれており、さらに事前学習・中間学習・SFT（Supervised Fine-Tuning：教師あり微調整）データの追加を模索中だと述べました。また、対応するトークンビューアーもここで共有されています。

モデル構築と並行して、オープンな評価およびデータセットの取り組みも成熟しています：Kevin Li の SWE-ZERO-12M-trajectories は、最大のオープンエージェント追跡データセットとして位置づけられています。これは 112B トークン、1200 万本のトラジェクトリ（行動経路）、122K の PR（Pull Request：プルリクエスト）、3K のリポジトリ、16 の言語をカバーしています。Victor Mustar は llama-eval を、より比較可能な llama.cpp コミュニティ評価への一歩として指摘しました。一方、Steve Rabinovich と Sayash Kapoor は、信頼できるエージェント評価にはログ分析が必要であり、結果のみに基づく指標では不十分だと主張しました。その理由は、強力なエージェントほど隠されたベンチマークのバグや報酬ハッキングの経路を露呈させるからです。

エンタープライズ AI の価格設定、プラットフォーム競争、および流通

Anthropic と OpenAI の競争は、エンタープライズ向け展開と開発者ロックインの観点から激化しています：Andrew Curran が引用した Ramp のデータによると、4 月の企業採用において Anthropic は 34.4%、OpenAI は 32.3% で、企業導入における最初の明確な首位交代が見られました。The Rundown も同様の数値を報じました。同時に Anthropic はプラン経済モデルを変更しました：ClaudeDevs は、プログラム利用（Agent SDK、claude-p、GitHub Actions、サードパーティ製 SDK アプリ全体）に対して有料の Claude プランに専用月次クレジットが付与されると発表しました。これは直ちにパワーユーザーによってサブスクリプションで補助されたハッチ（harnesses）に対する重大な制限と解釈され、Theo、Jeremy Howard、Matt Pocock、Omar Sanseviero からの批判を招きました。Anthropic はこれに対し、7 月 13 日までの Claude Code の週間利用制限を 50% 引き上げる措置で部分的に反発を和らげました。これは以前発表された 2 倍の 5 時間利用制限拡大の上に重ねて実施されました。

OpenAI は Codex エンタープライズ向けインセンティブで激しく対抗しました：OpenAI Devs と Sam Altman は、今後 30 日以内に切り替えるエンタープライズ顧客に対し、Codex の 2 ヶ月無料利用を提供すると発表しました。また OpenAI は技術プラットフォームの詳細も公開し、ローカルユーザー、ファイアウォールルール、ACL（アクセス制御リスト）、書き込み制限トークン、DPAPI、およびローカルファイルシステムやツールへのアクセスを持つコーディングエージェントを安全に実行するために必要なヘルパー実行ファイルの組み合わせを記述した Windows サンドボックス設計の解説文書を公開しました。現在の競争動態は「最良モデルが勝つ」という図式よりも、「補助金＋ワークフロー制御＋ハッチ互換性」の構図のように見えます。

エンタープライズ採用は、ランタイム/セキュリティの保証とますます密接に結びついています：Perplexity は、ハードウェア隔離サンドボックスアーキテクチャ、VPC レベルでの分離、エージェントアクション前の外部コンテンツのスキャン、および暗号化と自動削除に関する追加詳細を説明しました。Aravind Srinivas はこれを、Perplexity がエンタープライズ向け知識/研究プラットフォームとなるための基盤として位置づけました。より広範なパターンとして、エージェントベンダーはもはや知能だけを販売しているのではなく、制限された実行環境を販売するようになっています。

自律型科学、サイバー能力、およびロボティクス

再帰的自己改善は、アイデアからスタートアップクラスターへと移行しました：最も大きな単一のメタテーマは、科学の自動化と安全な自己改善を行う AI を構築するために設立された Recursive の立ち上げです。Richard Socher、Josh Tobin、Dominik Schmidt、Jenny Zhang、Shengran Hu からの立ち上げ投稿は、オープンエンドネス、AI Scientist、および研究自動化の取り組みからチームが構成されていることを示唆しています。隣接する取り組みとして、Adaption の AutoScientist は、フロンティアラボの外で完全なトレーニング・研究ループを自動化することを目指しており、Sarah Hooker は、モデルトレーニングの失敗の多くは計算資源の不足というよりも、研究ループの脆さによるものだと主張しています。

サイバー能力評価はさらに厳格化している：英国 AI セキュリティ研究所（AISI）によると、最先端モデルが完了できるサイバータスクの長さは数ヶ月ごとに倍増しており、最近のモデルは過去の傾向を上回っているという。Anthropic/Glasswing の Logan Graham 氏は、Claude Mythos Preview が Cooling Tower を含む AISI のエンドツーエンド型サイバーレンジを両方とも解決した初のモデルであり、同研究所が設定する 250 万トークンの制限内で全タスクをクリアした唯一のモデルであると述べた。XBOW は「トークン単位で前例のない精度」を示し、パートナーによる利用では数週間で数千件の高/重大な脆弱性が発見されたと報じられている。scaling01 からの独立したコメントでは、新しい Mythos バージョンがサイバーレンジを 10 回中 6 回完了したのに対し、プレビューベースラインは 3 回中 1 回のみであったとされている。

ロボティクス分野で具体的な長期展開デモが実現：Figure の Brett Adcock 氏は、Helix-02 を用いたパッケージ仕分けにおいて、人型ロボットが 8 時間にわたる完全自律シフトを実行する様子をストリーミング配信した。追加情報として、これらのロボットはカメラの画素から推論を行い、人間と同等の速度（約 3 秒/個）で動作し、オンデバイスでの推論を実行し、ネットワーク化されたファームとして協調し、バッテリー残量が低下すると自律的に交換し、必要に応じて自己診断を行いメンテナンスへフェイルオーバーするものである。これは、単なる短いベンチマーククリップではなく、複数ロボットによる長期間の無人オーケストレーションを明確に示した公的なデモの一つである。

エンゲージメント上位ツイート

Claude Code の価格設定と制限：@ClaudeDevs が週間の利用限度を 50% 引き上げたと発表し、またプログラムによるクレジット利用についても言及。これに対し @theo から開発者コミュニティに強い反発が巻き起こり、結果として価格政策が当日の最も重要な開発者関連ニュースとなりました。

Codex の企業向け推進：@sama が切り替えユーザーに対して Codex の利用を 2 ヶ月無料提供する方針を示し、@OpenAIDevs も企業向けの呼びかけを行いました。これは極めて直接的な市場参入戦略への対抗措置として注目されました。

Figure の 8 時間連続ヒューマノイド稼働：@adcock_brett が配信したライブストリーミング投稿は大きな注目を集め、明確な技術的実質性を伴う数少ないバイラル投稿の一つとなりました。

Cline SDK のリリース：@cline による SDK の公開は、オープンソースのコーディングエージェント基盤に対する需要を反映し、真に技術的な内容を持つリリースの中で最も高いエンゲージメントを獲得しました。

Token Superposition Training（トークン重畳学習）：@NousResearch が投稿した TST は、推論時のアーキテクチャを変更せずにトレーニング速度を 2～3 倍向上させるという具体的な主張が経済的に重要であるため、広く認知された稀な事前学習手法に関するツイートとして際立っていました。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference（効率的なオンデバイス大規模言語モデル推論）

Needle: Gemini のツール呼び出しを 26M パラメータモデルに凝縮 (アクティビティ：451): Cactus Compute が、"Simple Attention Network" アーキテクチャ（FFN/MLP を使わないアテンションとゲート制御の組み合わせ）を採用した、26M パラメータのシングルショット関数/ツール呼び出しモデル「Needle」をオープンソース化しました。同社は、ツールの使用は深い推論ではなく主に検索・スロット抽出・JSON 組み立てであると主張しています。このモデルは 16 基の TPU v6e で 27 時間かけて 200B トークンで事前学習され、Gemini が合成した 2B トークンの関数呼び出しデータで 45 分間ポストトレーニングされました。コンシューマーデバイスにおいてプリフィルは 6000 トークン/秒、デコードは 1200 トークン/秒を達成できると claiming しており、シングルショット関数呼び出しにおいては FunctionGemma-270M、Qwen-0.6B、Granite-350M、LFM2.5-350M を上回ったと報告されています。コードと重みは MIT ライセンスで GitHub および Hugging Face で公開されており、アーキテクチャの詳細は SAN のホワイトペーパーに記載されています。コメント欄では、Needle がパラメータを持つ大規模言語モデル（LLM）に対してツールを選択したりクエリを振り分けたりする軽量ルーターとして有用である可能性が指摘される一方で、同じ FFN なし/クロスアテンションアプローチが要約タスクにも一般化できるかについては疑問の声が上がっています。また、技術的な注意点として、リポジトリに Python の pickle ファイルが含まれている可能性があり、コード実行やセキュリティのリスク、Python 固有の移植性の問題から推奨されないという指摘がありました。

複数のコメント投稿者が、26M の軽量化されたツール呼び出しモデルを軽量ルーターとして用いることのアーキテクチャ上の含意に焦点を当てました。これは、完全な回答を生成するのではなく、適切なパラメータで要求を分類・ルーティングし、対応する大規模言語モデル（LLM）、ツール、または RAG パイプラインへ送ることを可能にするものです。ある投稿者は、これを構造化された RAG 出力を受け取り、それを自然言語で口頭化する小規模なポストトレーニングモデルへと拡張できる可能性を指摘しました。

ある技術的な指摘として、「FFN なし」という主張に関する点が挙げられました：外部の構造化知識が常にツール/RAG/検索を通じて提供される場合、モデルは重みに事実知識を保存するために FFN 層を必要としない可能性があります。これは、記憶ではなく提供された文脈に対する調整や grounding に特化した、コンパクトでアテンション重視のモデルという設計パターンが存在する可能性を示唆しています。

ある投稿者は、Python 固有の依存関係のカップリングや、デシリアライズ時の任意コード実行リスクのため、pickle ファイルの公開がますます稀になっていると指摘しました。別の投稿者は、Gemini がツール呼び出しにおいて目立つ挙動の不具合（ツールの特定性に関するシステムプロンプトレベルのパッチや、cat などの非効率的なファイル操作を避け、grep_search などの専用ツールを利用する傾向など）を持っていたことを強調し、もし Gemini 生成のトレースが蒸留データとして使用される場合、これは重要になり得ると述べています。

実在するトランスフォーマー言語モデルを、標準のゲームボーイカラー上でローカル実行しました！ (アクティビティ: 1326): 画像に写っているのは、TINYSTORIES Q8 GBC とラベル付けされたローカルトランスフォーマーデモを実行中の標準のゲームボーイカラーです。これは投稿が主張している通り、Andrej Karpathy の TinyStories-260K が INT8/固定小数点に変換され、PC や Wi-Fi、リンクケーブル、クラウド推論を介さずにデバイス上で直接実行されたことを裏付けています: image。このプロジェクトでは GBDK-2020 を使用し、重みには MBC5 ゲームボーイ ROM のバンクスイッチングカートリッジ ROM を、KV キャッシュ（Key-Value Cache）にはカートリッジの SRAM をそれぞれ利用しています。トークン化とプロンプト入力もデバイス上で行われます。著者は生成が*極めて遅く*、過度な量子化や近似のためほとんど意味不明な文字列になると指摘していますが、トランスフォーマーのプリフェッチ（prefill）と自己回帰ループは機能しているとしています。ソースコード: github.com/maddiedreese/gbc-transformer。コメントの多くは技術的な詳細よりも驚きや感銘に焦点を当てており、非現実的ではあるが説得力のある概念実証（Proof-of-Concept）として捉えられています。例えば「無意味だ。ゆえに不可欠である」という言葉や、同様の実験を他のプラットフォームへ移植することへの関心が寄せられています。

原文を表示

a quiet day.

AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Agent Infrastructure, Harnesses, and Developer Platforms

Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory: Cline open-sourced a rebuilt Cline SDK and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. LangChain shipped a large batch of agent lifecycle infrastructure at Interrupt: LangSmith Engine, SmithDB, Sandboxes, Managed Deep Agents, LLM Gateway, Context Hub, and Deep Agents 0.6. The most technically notable piece is SmithDB, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding 12–15× faster access on key workloads; the team says it is built atop Apache DataFusion and Vortex. In parallel, Notion’s External Agents API lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. Cursor expanded cloud agents with fully configured development environments including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.

Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat: Several launches converged on the same design direction. Duet Agent proposes a state-machine harness for jobs that last weeks or months, with parent/sub-agent coordination and memory replacing compaction. LangChain’s OSS updates added streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning, all aimed at richer agent event streams than plain tokens. Tabracadabra moved from autocomplete to a context-aware assistant in any textbox, while VS Code introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need durable execution, inspectable intermediate state, and tool-native UI surfaces rather than stateless prompt/response loops.

Model Training, Architecture, and Data Efficiency

Pretraining efficiency and architectural experimentation were the strongest research throughline: Nous Research’s Token Superposition Training modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report 2–3× wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argued current message-based/chat training overly constrains agents to a single stream and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked here. δ-mem proposed an external online associative memory attached to a frozen full-attention backbone, with an 8×8 state reportedly improving average score by 1.10× and beating non-δ-mem baselines by 1.15×, with larger gains on memory-heavy benchmarks.

Post-training/compression and data curation also produced notable results: NVIDIA’s Star Elastic claims one post-training run can derive a family of reasoning model sizes, at 360× lower cost than pretraining a family and 7× better than SOTA compression. Datology’s VLM work, highlighted by Siddharth Joshi and Pratyush Maini, argues data curation alone can produce major multimodal gains: +11.7 points across 20 public VLM benchmarks at 2B, beating InternVL3.5-2B by roughly 10 points at about 17× less training compute, and near-frontier 4B performance with 3.3× lower response FLOPs than Qwen3-VL-4B. On the open data side, Percy Liang said the next Marin run already has 18T tokens in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer shared here.

Open evaluation and dataset work is maturing alongside model building: Kevin Li’s SWE-ZERO-12M-trajectories is positioned as the largest open agentic trace dataset: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages. Victor Mustar flagged llama-eval as a step toward more comparable llama.cpp community evals. Meanwhile, Steve Rabinovich and Sayash Kapoor argued credible agent evaluation requires log analysis, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.

Enterprise AI Pricing, Platform Competition, and Distribution

Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in: Ramp data cited by Andrew Curran showed Anthropic at 34.4% of businesses vs OpenAI at 32.3% in April, the first apparent lead change in business adoption; The Rundown amplified the same figures. At the same time, Anthropic changed plan economics: ClaudeDevs announced that paid Claude plans will get a dedicated monthly credit for programmatic usage across the Agent SDK, claude -p, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from Theo, Jeremy Howard, Matt Pocock, and Omar Sanseviero. Anthropic partially offset that backlash with a separate 50% increase in Claude Code weekly limits through July 13, stacked on the previously announced 2× 5-hour limit increase.

OpenAI responded aggressively with Codex enterprise incentives: OpenAI Devs and Sam Altman offered two months of free Codex usage for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a Windows sandbox design write-up describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like “best model wins” and more like subsidy + workflow control + harness compatibility.

Enterprise adoption is increasingly tied to runtime/security assurances: Perplexity described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with additional details on encryption and auto-deletion. Aravind Srinivas framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; they’re selling bounded execution environments.

Autonomous Science, Cyber Capability, and Robotics

Recursive self-improvement moved from idea to startup cluster: The largest single meta-theme was the launch of Recursive, founded to build AI that automates science and safely improves itself. Launch posts from Richard Socher, Josh Tobin, Dominik Schmidt, Jenny Zhang, and Shengran Hu suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, Adaption’s AutoScientist aims to automate the full training-research loop outside frontier labs, with Sarah Hooker arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.

Cyber capability evaluations continue to steepen: The UK AI Security Institute said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswing’s Logan Graham said Claude Mythos Preview is the first model to solve both AISI end-to-end cyber ranges, including Cooling Tower, and the only one to clear every task under the institute’s 2.5M-token cap. XBOW reportedly found “token-for-token, unprecedented precision,” and partner usage allegedly surfaced thousands of high/critical vulnerabilities in weeks. Independent commentary from scaling01 claimed a newer Mythos version completed a cyber range 6/10 times vs 3/10 for the preview baseline.

Robotics got a concrete long-horizon deployment demo: Figure’s Brett Adcock streamed humanoid robots running a full 8-hour autonomous shift on package sorting using Helix-02, with follow-up details that the robots reason from camera pixels, operate around human parity (~3s/package), perform on-device inference, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed here. This is one of the clearer public demonstrations of multi-robot, long-duration, no-human-in-the-loop orchestration rather than a short benchmark clip.

Top tweets (by engagement)

Claude Code pricing and limits: @ClaudeDevs on 50% higher weekly limits, @ClaudeDevs on programmatic credits, and the ensuing developer backlash from @theo made pricing policy the day’s most consequential developer story.

Codex enterprise push: @sama offering two free months of Codex usage for switchers and @OpenAIDevs’ enterprise call-to-action signaled an unusually direct go-to-market counterpunch.

Figure’s 8-hour humanoid shift: @adcock_brett’s livestream post drew enormous attention and is one of the few viral posts in the set with clear technical substance.

Cline SDK launch: @cline’s SDK release was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.

Token Superposition Training: @NousResearch’s TST post stood out as a rare pretraining-method tweet that broke through widely, likely because the claim—2–3× training speedup without changing inference-time architecture—is concrete and economically important.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

Needle: We Distilled Gemini Tool Calling Into a 26M Model (Activity: 451): Cactus Compute open-sourced Needle, a 26M-parameter single-shot function/tool-calling model using a “Simple Attention Network” architecture—attention + gating with no FFNs/MLPs—arguing tool use is mainly retrieval/slot extraction/JSON assembly rather than deep reasoning. It was pretrained on 200B tokens over 16 TPU v6e in 27h, post-trained on 2B Gemini-synthesized function-calling tokens in 45m, claims 6000 tok/s prefill and 1200 tok/s decode on consumer devices, and reportedly beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling; code/weights are MIT-licensed on GitHub, Hugging Face, with architecture notes in the SAN writeup. Commenters framed Needle as potentially useful as a lightweight router that selects tools or dispatches queries to larger LLMs with parameters, while questioning whether the same no-FFN/cross-attention approach could generalize to summarization. One technical caution noted the repository apparently includes Python pickle files, which are discouraged due to code-execution/security risks and Python-specific portability issues.

Several commenters focused on the architectural implication of a 26M distilled tool-calling model as a lightweight router: it could classify/route requests to the appropriate larger LLM, tool, or RAG pipeline with the right parameters, rather than generating full answers itself. One suggested this could be extended into a small post-trained model that consumes structured RAG output and verbalizes it in natural language.

A technical point was raised around the claimed “no FFN” result: if external structured knowledge is always supplied via tools/RAG/retrieval, the model may not need FFN layers to store factual knowledge in weights. This implies a possible design pattern where compact attention-heavy models specialize in orchestration or grounding over provided context instead of memorization.

One commenter noted that publishing pickle files is increasingly uncommon because of Python-specific dependency coupling and arbitrary-code-execution risks during deserialization. Another highlighted that Gemini has had visible tool-calling quirks, including system-prompt-level patches around tool specificity and avoiding inefficient file operations like cat in favor of dedicated tools such as grep_search, which could matter if Gemini-generated traces were used as distillation data.

I got a real transformer language model running locally on a stock Game Boy Color! (Activity: 1326): The image shows a stock Game Boy Color running a local transformer demo labeled TINYSTORIES Q8 GBC, matching the post’s claim that Andrej Karpathy’s TinyStories-260K was converted to INT8/fixed-point and executed directly on-device without PC, Wi‑Fi, link cable, or cloud inference: image. The project uses GBDK-2020, an MBC5 Game Boy ROM, bank-switched cartridge ROM for weights, cartridge SRAM for the KV cache, and on-device tokenization/prompt entry; the author notes generation is *extremely slow* and mostly gibberish due to heavy quantization/approximation, but the transformer prefill + autoregressive loop works. Source code: github.com/maddiedreese/gbc-transformer. Comments are mostly impressed rather than technical, framing it as an impractical but compelling proof-of-concept—e.g. *“Pointless. Therefore, indispensable.”* and interest in porting similar experiments to

この記事をシェア

MarkTechPost重要度42026年6月30日 17:13

Meta AI、非侵襲型脳からテキストへ変換する「Brain2Qwerty v2」を公開し、タイピング中の文章を61%の単語精度で復元

AWS Machine Learning Blog重要度42026年6月30日 02:25

Amazon Bedrock AgentCore Observability を用いたプロダクションエージェントのデバッグ

LangChain Blog重要度42026年6月30日 01:17

Deep Agents に動的サブエージェントを導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年5月13日 14:44·約18分

本日は特に目立った出来事なし

#Agent Infrastructure #Observability #State Management #Open Source #LangChain

TL;DR

AI深層分析2026年5月14日 10:03

重要/ 5段階

深度40%

キーポイント

エージェント基盤の深化とオープンソース化

次世代観測データベースの登場

エージェント UX の設計転換

エコシステム統合とセキュリティ強化

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日でした。

AI ツイートリキャップ

エージェントインフラ、ハーネス（Harnesses）、および開発者プラットフォーム

Cline、LangChain、Notion、Cursor はすべてエージェントプラットフォームの領域にさらに深く進出しました：Cline は再構築された Cline SDK をオープンソース化し、TUI（ターミナルユーザーインターフェース）を備えた CLI を刷新して、エージェントチーム、スケジュールジョブ、コネクタ機能を追加し、そのハルネスをカスタムコーディングエージェントのための再利用可能な基盤として位置づけました。LangChain は Interrupt でエージェントライフサイクルインフラストラクチャの大量リリースを行いました：LangSmith Engine、SmithDB、サンドボックス、マネージドディープエージェント、LLM ゲートウェイ、コンテキストハブ、そして Deep Agents 0.6 です。技術的に最も注目すべき点は SmithDB で、これはネストされた長時間実行トレースや大規模ペイロードを対象とした専用観測性データベースであり、主要なワークロードで 12～15 倍の高速アクセスを実現したと報告されています。チームによると、これは Apache DataFusion と Vortex を基盤に構築されています。並行して、Notion の External Agents API により、Claude、Codex、Cursor、Decagon、Warp、Devin などのサードパーティ製エージェントが、別のサイロとしてではなく、共有かつレビュー可能なコンテキストレイヤーとして Notion 内で直接動作できるようになりました。Cursor はクラウドエージェントを拡張し、クローンされたリポジトリ、依存関係、バージョン履歴、ロールバック機能、スコープ付きの外部通信制限（egress）、隔離されたシークレットを含む完全に構成された開発環境を提供しています。

エージェント UX は、チャットよりもむしろ長期間実行される状態の管理、ストリーミング、オーケストレーション increasingly 重要になっています：複数のリリースが同じデザイン方向に収束しました。Duet Agent は数週間から数ヶ月続くジョブのためのステートマシンハネスを提案し、親/子エージェント間の調整とメモリ管理により、従来の圧縮（compaction）に代わるアプローチを採用しています。LangChain のオープンソースアップデートでは、ストリーミング型投影（typed projections）、チェックポイント保存機能、コードインタプリタ、ハネスプロファイル、モデル固有のチューニングが追加され、単なるトークンではなくより豊かなエージェントイベントストリームを実現することを目指しています。Tabracadabra は補完機能からあらゆるテキストボックスで文脈を認識するアシスタントへと進化し、VS Code では「Agents」ウィンドウが導入され、複数プロジェクトにわたるタスクレビュー機能が強化されました。これらのリリース全体を通じて示されるアーキテクチャ上のメッセージは、本番環境のエージェントには、状態レスなプロンプト/レスポンスループではなく、永続的な実行機能、検証可能な中間状態、そしてツールネイティブな UI サフェースがますます必要となっているということです。

モデルトレーニング、アーキテクチャ、およびデータ効率性

プリトレーニングの効率化とアーキテクチャの実験が最も強い研究の継続線でした：Nous Research の Token Superposition Training は、プリトレーニングの初期段階を修正し、モデルが標準的な次トークン予測に戻る前に連続するトークンのバッチを読み取り・予測できるようにします。彼らは、FLOPs を同等に保ちながら推論時のアーキテクチャ変更なしで 2～3 倍の壁時計速度向上を報告しており、これは 270M から 3B の密集型モデルおよび 10B-A1B の MoE モデルで検証されています。Jonas Geiping らは、現在のメッセージベース/チャットトレーニングがエージェントを単一のストリームに過度に制約していると主張し、低レイテンシ、関心の明確な分離、より読みやすい並列推論・ツール使用を実現するマルチストリーム LLM の論文を発表しました。論文とコードはこちらでリンクされています。δ-mem は、凍結されたフルアテンションバックボーンに外部オンライン連想メモリを接続する提案を行い、8×8 の状態が平均スコアを 1.10 倍向上させ、非δ-mem ベースラインを 1.15 倍上回ると報告されています。また、メモリー依存度の高いベンチマークではより大きな改善が見られます。

学習後・圧縮およびデータキュレーションにおいても注目すべき成果が生まれています：NVIDIA の Star Elastic は、1 回の学習後実行で推論モデルのサイズファミリーを導出可能であり、そのコストは事前学習によるファミリー構築の 360 分の 1 で、SOTA（State-of-the-Art）圧縮手法よりも 7 倍優れています。Siddharth Joshi と Pratyush Maini が紹介した Datology の VLM（Vision-Language Model：視覚言語モデル）研究では、データキュレーション単体でも主要なマルチモーダル性能の向上が可能であると主張しています。具体的には、2B パラメータ規模で 20 の公開 VLM ベンチマーク全体で +11.7 ポイントの向上を達成し、InternVL3.5-2B を約 10 ポイント上回っています。これは学習計算量が約 17 分の 1 で実現されたものであり、Qwen3-VL-4B と比較して応答 FLOPs（Floating Point Operations：浮動小数点演算）が 3.3 倍少ないにもかかわらず、最前線に近い 4B パラメータ規模の性能を達成しています。オープンデータ側では、Percy Liang が次の Marin ランにはすでに 18T トークンが含まれており、さらに事前学習・中間学習・SFT（Supervised Fine-Tuning：教師あり微調整）データの追加を模索中だと述べました。また、対応するトークンビューアーもここで共有されています。

モデル構築と並行して、オープンな評価およびデータセットの取り組みも成熟しています：Kevin Li の SWE-ZERO-12M-trajectories は、最大のオープンエージェント追跡データセットとして位置づけられています。これは 112B トークン、1200 万本のトラジェクトリ（行動経路）、122K の PR（Pull Request：プルリクエスト）、3K のリポジトリ、16 の言語をカバーしています。Victor Mustar は llama-eval を、より比較可能な llama.cpp コミュニティ評価への一歩として指摘しました。一方、Steve Rabinovich と Sayash Kapoor は、信頼できるエージェント評価にはログ分析が必要であり、結果のみに基づく指標では不十分だと主張しました。その理由は、強力なエージェントほど隠されたベンチマークのバグや報酬ハッキングの経路を露呈させるからです。

エンタープライズ AI の価格設定、プラットフォーム競争、および流通

Anthropic と OpenAI の競争は、エンタープライズ向け展開と開発者ロックインの観点から激化しています：Andrew Curran が引用した Ramp のデータによると、4 月の企業採用において Anthropic は 34.4%、OpenAI は 32.3% で、企業導入における最初の明確な首位交代が見られました。The Rundown も同様の数値を報じました。同時に Anthropic はプラン経済モデルを変更しました：ClaudeDevs は、プログラム利用（Agent SDK、claude-p、GitHub Actions、サードパーティ製 SDK アプリ全体）に対して有料の Claude プランに専用月次クレジットが付与されると発表しました。これは直ちにパワーユーザーによってサブスクリプションで補助されたハッチ（harnesses）に対する重大な制限と解釈され、Theo、Jeremy Howard、Matt Pocock、Omar Sanseviero からの批判を招きました。Anthropic はこれに対し、7 月 13 日までの Claude Code の週間利用制限を 50% 引き上げる措置で部分的に反発を和らげました。これは以前発表された 2 倍の 5 時間利用制限拡大の上に重ねて実施されました。

OpenAI は Codex エンタープライズ向けインセンティブで激しく対抗しました：OpenAI Devs と Sam Altman は、今後 30 日以内に切り替えるエンタープライズ顧客に対し、Codex の 2 ヶ月無料利用を提供すると発表しました。また OpenAI は技術プラットフォームの詳細も公開し、ローカルユーザー、ファイアウォールルール、ACL（アクセス制御リスト）、書き込み制限トークン、DPAPI、およびローカルファイルシステムやツールへのアクセスを持つコーディングエージェントを安全に実行するために必要なヘルパー実行ファイルの組み合わせを記述した Windows サンドボックス設計の解説文書を公開しました。現在の競争動態は「最良モデルが勝つ」という図式よりも、「補助金＋ワークフロー制御＋ハッチ互換性」の構図のように見えます。

エンタープライズ採用は、ランタイム/セキュリティの保証とますます密接に結びついています：Perplexity は、ハードウェア隔離サンドボックスアーキテクチャ、VPC レベルでの分離、エージェントアクション前の外部コンテンツのスキャン、および暗号化と自動削除に関する追加詳細を説明しました。Aravind Srinivas はこれを、Perplexity がエンタープライズ向け知識/研究プラットフォームとなるための基盤として位置づけました。より広範なパターンとして、エージェントベンダーはもはや知能だけを販売しているのではなく、制限された実行環境を販売するようになっています。

自律型科学、サイバー能力、およびロボティクス

再帰的自己改善は、アイデアからスタートアップクラスターへと移行しました：最も大きな単一のメタテーマは、科学の自動化と安全な自己改善を行う AI を構築するために設立された Recursive の立ち上げです。Richard Socher、Josh Tobin、Dominik Schmidt、Jenny Zhang、Shengran Hu からの立ち上げ投稿は、オープンエンドネス、AI Scientist、および研究自動化の取り組みからチームが構成されていることを示唆しています。隣接する取り組みとして、Adaption の AutoScientist は、フロンティアラボの外で完全なトレーニング・研究ループを自動化することを目指しており、Sarah Hooker は、モデルトレーニングの失敗の多くは計算資源の不足というよりも、研究ループの脆さによるものだと主張しています。

サイバー能力評価はさらに厳格化している：英国 AI セキュリティ研究所（AISI）によると、最先端モデルが完了できるサイバータスクの長さは数ヶ月ごとに倍増しており、最近のモデルは過去の傾向を上回っているという。Anthropic/Glasswing の Logan Graham 氏は、Claude Mythos Preview が Cooling Tower を含む AISI のエンドツーエンド型サイバーレンジを両方とも解決した初のモデルであり、同研究所が設定する 250 万トークンの制限内で全タスクをクリアした唯一のモデルであると述べた。XBOW は「トークン単位で前例のない精度」を示し、パートナーによる利用では数週間で数千件の高/重大な脆弱性が発見されたと報じられている。scaling01 からの独立したコメントでは、新しい Mythos バージョンがサイバーレンジを 10 回中 6 回完了したのに対し、プレビューベースラインは 3 回中 1 回のみであったとされている。

ロボティクス分野で具体的な長期展開デモが実現：Figure の Brett Adcock 氏は、Helix-02 を用いたパッケージ仕分けにおいて、人型ロボットが 8 時間にわたる完全自律シフトを実行する様子をストリーミング配信した。追加情報として、これらのロボットはカメラの画素から推論を行い、人間と同等の速度（約 3 秒/個）で動作し、オンデバイスでの推論を実行し、ネットワーク化されたファームとして協調し、バッテリー残量が低下すると自律的に交換し、必要に応じて自己診断を行いメンテナンスへフェイルオーバーするものである。これは、単なる短いベンチマーククリップではなく、複数ロボットによる長期間の無人オーケストレーションを明確に示した公的なデモの一つである。

エンゲージメント上位ツイート

Claude Code の価格設定と制限：@ClaudeDevs が週間の利用限度を 50% 引き上げたと発表し、またプログラムによるクレジット利用についても言及。これに対し @theo から開発者コミュニティに強い反発が巻き起こり、結果として価格政策が当日の最も重要な開発者関連ニュースとなりました。

Codex の企業向け推進：@sama が切り替えユーザーに対して Codex の利用を 2 ヶ月無料提供する方針を示し、@OpenAIDevs も企業向けの呼びかけを行いました。これは極めて直接的な市場参入戦略への対抗措置として注目されました。

Figure の 8 時間連続ヒューマノイド稼働：@adcock_brett が配信したライブストリーミング投稿は大きな注目を集め、明確な技術的実質性を伴う数少ないバイラル投稿の一つとなりました。

Cline SDK のリリース：@cline による SDK の公開は、オープンソースのコーディングエージェント基盤に対する需要を反映し、真に技術的な内容を持つリリースの中で最も高いエンゲージメントを獲得しました。

Token Superposition Training（トークン重畳学習）：@NousResearch が投稿した TST は、推論時のアーキテクチャを変更せずにトレーニング速度を 2～3 倍向上させるという具体的な主張が経済的に重要であるため、広く認知された稀な事前学習手法に関するツイートとして際立っていました。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference（効率的なオンデバイス大規模言語モデル推論）

Needle: Gemini のツール呼び出しを 26M パラメータモデルに凝縮 (アクティビティ：451): Cactus Compute が、"Simple Attention Network" アーキテクチャ（FFN/MLP を使わないアテンションとゲート制御の組み合わせ）を採用した、26M パラメータのシングルショット関数/ツール呼び出しモデル「Needle」をオープンソース化しました。同社は、ツールの使用は深い推論ではなく主に検索・スロット抽出・JSON 組み立てであると主張しています。このモデルは 16 基の TPU v6e で 27 時間かけて 200B トークンで事前学習され、Gemini が合成した 2B トークンの関数呼び出しデータで 45 分間ポストトレーニングされました。コンシューマーデバイスにおいてプリフィルは 6000 トークン/秒、デコードは 1200 トークン/秒を達成できると claiming しており、シングルショット関数呼び出しにおいては FunctionGemma-270M、Qwen-0.6B、Granite-350M、LFM2.5-350M を上回ったと報告されています。コードと重みは MIT ライセンスで GitHub および Hugging Face で公開されており、アーキテクチャの詳細は SAN のホワイトペーパーに記載されています。コメント欄では、Needle がパラメータを持つ大規模言語モデル（LLM）に対してツールを選択したりクエリを振り分けたりする軽量ルーターとして有用である可能性が指摘される一方で、同じ FFN なし/クロスアテンションアプローチが要約タスクにも一般化できるかについては疑問の声が上がっています。また、技術的な注意点として、リポジトリに Python の pickle ファイルが含まれている可能性があり、コード実行やセキュリティのリスク、Python 固有の移植性の問題から推奨されないという指摘がありました。

ある技術的な指摘として、「FFN なし」という主張に関する点が挙げられました：外部の構造化知識が常にツール/RAG/検索を通じて提供される場合、モデルは重みに事実知識を保存するために FFN 層を必要としない可能性があります。これは、記憶ではなく提供された文脈に対する調整や grounding に特化した、コンパクトでアテンション重視のモデルという設計パターンが存在する可能性を示唆しています。

ある投稿者は、Python 固有の依存関係のカップリングや、デシリアライズ時の任意コード実行リスクのため、pickle ファイルの公開がますます稀になっていると指摘しました。別の投稿者は、Gemini がツール呼び出しにおいて目立つ挙動の不具合（ツールの特定性に関するシステムプロンプトレベルのパッチや、cat などの非効率的なファイル操作を避け、grep_search などの専用ツールを利用する傾向など）を持っていたことを強調し、もし Gemini 生成のトレースが蒸留データとして使用される場合、これは重要になり得ると述べています。

原文を表示

a quiet day.

AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Agent Infrastructure, Harnesses, and Developer Platforms

Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory: Cline open-sourced a rebuilt Cline SDK and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. LangChain shipped a large batch of agent lifecycle infrastructure at Interrupt: LangSmith Engine, SmithDB, Sandboxes, Managed Deep Agents, LLM Gateway, Context Hub, and Deep Agents 0.6. The most technically notable piece is SmithDB, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding 12–15× faster access on key workloads; the team says it is built atop Apache DataFusion and Vortex. In parallel, Notion’s External Agents API lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. Cursor expanded cloud agents with fully configured development environments including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.

Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat: Several launches converged on the same design direction. Duet Agent proposes a state-machine harness for jobs that last weeks or months, with parent/sub-agent coordination and memory replacing compaction. LangChain’s OSS updates added streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning, all aimed at richer agent event streams than plain tokens. Tabracadabra moved from autocomplete to a context-aware assistant in any textbox, while VS Code introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need durable execution, inspectable intermediate state, and tool-native UI surfaces rather than stateless prompt/response loops.

Model Training, Architecture, and Data Efficiency

Pretraining efficiency and architectural experimentation were the strongest research throughline: Nous Research’s Token Superposition Training modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report 2–3× wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argued current message-based/chat training overly constrains agents to a single stream and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked here. δ-mem proposed an external online associative memory attached to a frozen full-attention backbone, with an 8×8 state reportedly improving average score by 1.10× and beating non-δ-mem baselines by 1.15×, with larger gains on memory-heavy benchmarks.

Post-training/compression and data curation also produced notable results: NVIDIA’s Star Elastic claims one post-training run can derive a family of reasoning model sizes, at 360× lower cost than pretraining a family and 7× better than SOTA compression. Datology’s VLM work, highlighted by Siddharth Joshi and Pratyush Maini, argues data curation alone can produce major multimodal gains: +11.7 points across 20 public VLM benchmarks at 2B, beating InternVL3.5-2B by roughly 10 points at about 17× less training compute, and near-frontier 4B performance with 3.3× lower response FLOPs than Qwen3-VL-4B. On the open data side, Percy Liang said the next Marin run already has 18T tokens in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer shared here.

Open evaluation and dataset work is maturing alongside model building: Kevin Li’s SWE-ZERO-12M-trajectories is positioned as the largest open agentic trace dataset: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages. Victor Mustar flagged llama-eval as a step toward more comparable llama.cpp community evals. Meanwhile, Steve Rabinovich and Sayash Kapoor argued credible agent evaluation requires log analysis, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.

Enterprise AI Pricing, Platform Competition, and Distribution

Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in: Ramp data cited by Andrew Curran showed Anthropic at 34.4% of businesses vs OpenAI at 32.3% in April, the first apparent lead change in business adoption; The Rundown amplified the same figures. At the same time, Anthropic changed plan economics: ClaudeDevs announced that paid Claude plans will get a dedicated monthly credit for programmatic usage across the Agent SDK, claude -p, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from Theo, Jeremy Howard, Matt Pocock, and Omar Sanseviero. Anthropic partially offset that backlash with a separate 50% increase in Claude Code weekly limits through July 13, stacked on the previously announced 2× 5-hour limit increase.

OpenAI responded aggressively with Codex enterprise incentives: OpenAI Devs and Sam Altman offered two months of free Codex usage for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a Windows sandbox design write-up describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like “best model wins” and more like subsidy + workflow control + harness compatibility.

Enterprise adoption is increasingly tied to runtime/security assurances: Perplexity described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with additional details on encryption and auto-deletion. Aravind Srinivas framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; they’re selling bounded execution environments.

Autonomous Science, Cyber Capability, and Robotics

Recursive self-improvement moved from idea to startup cluster: The largest single meta-theme was the launch of Recursive, founded to build AI that automates science and safely improves itself. Launch posts from Richard Socher, Josh Tobin, Dominik Schmidt, Jenny Zhang, and Shengran Hu suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, Adaption’s AutoScientist aims to automate the full training-research loop outside frontier labs, with Sarah Hooker arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.

Cyber capability evaluations continue to steepen: The UK AI Security Institute said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswing’s Logan Graham said Claude Mythos Preview is the first model to solve both AISI end-to-end cyber ranges, including Cooling Tower, and the only one to clear every task under the institute’s 2.5M-token cap. XBOW reportedly found “token-for-token, unprecedented precision,” and partner usage allegedly surfaced thousands of high/critical vulnerabilities in weeks. Independent commentary from scaling01 claimed a newer Mythos version completed a cyber range 6/10 times vs 3/10 for the preview baseline.

Robotics got a concrete long-horizon deployment demo: Figure’s Brett Adcock streamed humanoid robots running a full 8-hour autonomous shift on package sorting using Helix-02, with follow-up details that the robots reason from camera pixels, operate around human parity (~3s/package), perform on-device inference, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed here. This is one of the clearer public demonstrations of multi-robot, long-duration, no-human-in-the-loop orchestration rather than a short benchmark clip.

Top tweets (by engagement)

Claude Code pricing and limits: @ClaudeDevs on 50% higher weekly limits, @ClaudeDevs on programmatic credits, and the ensuing developer backlash from @theo made pricing policy the day’s most consequential developer story.

Codex enterprise push: @sama offering two free months of Codex usage for switchers and @OpenAIDevs’ enterprise call-to-action signaled an unusually direct go-to-market counterpunch.

Figure’s 8-hour humanoid shift: @adcock_brett’s livestream post drew enormous attention and is one of the few viral posts in the set with clear technical substance.

Cline SDK launch: @cline’s SDK release was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.

Token Superposition Training: @NousResearch’s TST post stood out as a rare pretraining-method tweet that broke through widely, likely because the claim—2–3× training speedup without changing inference-time architecture—is concrete and economically important.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

Needle: We Distilled Gemini Tool Calling Into a 26M Model (Activity: 451): Cactus Compute open-sourced Needle, a 26M-parameter single-shot function/tool-calling model using a “Simple Attention Network” architecture—attention + gating with no FFNs/MLPs—arguing tool use is mainly retrieval/slot extraction/JSON assembly rather than deep reasoning. It was pretrained on 200B tokens over 16 TPU v6e in 27h, post-trained on 2B Gemini-synthesized function-calling tokens in 45m, claims 6000 tok/s prefill and 1200 tok/s decode on consumer devices, and reportedly beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling; code/weights are MIT-licensed on GitHub, Hugging Face, with architecture notes in the SAN writeup. Commenters framed Needle as potentially useful as a lightweight router that selects tools or dispatches queries to larger LLMs with parameters, while questioning whether the same no-FFN/cross-attention approach could generalize to summarization. One technical caution noted the repository apparently includes Python pickle files, which are discouraged due to code-execution/security risks and Python-specific portability issues.

A technical point was raised around the claimed “no FFN” result: if external structured knowledge is always supplied via tools/RAG/retrieval, the model may not need FFN layers to store factual knowledge in weights. This implies a possible design pattern where compact attention-heavy models specialize in orchestration or grounding over provided context instead of memorization.

One commenter noted that publishing pickle files is increasingly uncommon because of Python-specific dependency coupling and arbitrary-code-execution risks during deserialization. Another highlighted that Gemini has had visible tool-calling quirks, including system-prompt-level patches around tool specificity and avoiding inefficient file operations like cat in favor of dedicated tools such as grep_search, which could matter if Gemini-generated traces were used as distillation data.

この記事をシェア

MarkTechPost重要度42026年6月30日 17:13

Meta AI、非侵襲型脳からテキストへ変換する「Brain2Qwerty v2」を公開し、タイピング中の文章を61%の単語精度で復元

AWS Machine Learning Blog重要度42026年6月30日 02:25

Amazon Bedrock AgentCore Observability を用いたプロダクションエージェントのデバッグ

LangChain Blog重要度42026年6月30日 01:17

Deep Agents に動的サブエージェントを導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference（効率的なオンデバイス大規模言語モデル推論）

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference（効率的なオンデバイス大規模言語モデル推論）

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

関連記事