Latent Space·2026年5月2日 16:21·約25分

AI エンジニア世界博覧会、第 2 回スピーカー募集を開始

#Autoresearch #Agentic Commerce #World Models #Vertical AI #Robotics

TL;DR

Latent Space が主催する AI Engineer World's Fair の第 2 回スピーカー募集を開始し、Autoresearch や Agentic Commerce など新たな技術トレンドと特定業界への応用を重点テーマとして掲げている。

AI深層分析2026年5月2日 18:28

重要/ 5段階

深度40%

キーポイント

新設された主要講演トラック

Autoresearch（自己改善ループ）、Memory（ユーザー利用によるモデルの進化）、World Models（空間知能と敵対的推論）、Tokenmaxxing（AI 採用の効率化）、Agentic Commerce（エージェント間取引）など、最新の技術動向に特化した新トラックが追加される。

特定業界への垂直統合 AI の強化

法務、ヘルスケア、GTM（Go-To-Market）、金融分野における AI 応用を重点的に募集しており、政府や教育分野の投稿も受け付けるが、前者の方がより活発な動向と見なされている。

ロボティクス展示スペースの無料提供

昨年の成功を受け、今年度は優れたロボットデモのためにエキスポフロアのスペースを無料で割り当てており、ヒューマノイドの場合は同伴者の付帯条件がある。

スタートアップ・バトルフィールドの開催

シリーズ A 前のスタートアップがトップ VC やゲスト審査員にピッチできる新イベント「Startup Battlefield」が追加され、資金調達や採用の機会を提供する。

Grok 4.3 のコストパフォーマンス向上と評価の分裂

xAI は Grok 4.3 をリリースし、入力・出力価格を大幅に削減して GDPval-AA で Elo が 321 上昇したが、非ハルシネーション率が低下したことで信頼性への懸念が残っている。

コミュニティ反応と構造的な批判

一部からはトークン効率の向上が評価された一方、中国製オープンソースモデルや Vending-Bench 2 での性能低下を指摘する声もあり、低価格がハードウェア利用率の悪さによるものではないかという批判もある。

DeepSeek V4 Pro とオープンウェイトの進化

記事タイトルおよび導入部で DeepSeek V4 Pro のリリースとビジョン・空間推論能力、そしてオープンウェイトモデルがクローズドモデルとの差を縮めていることが示唆されている。

影響分析・編集コメントを表示

影響分析

このニュースは、AI エンジニアリング分野のトレンドが「モデルの規模拡大」から「自律的な改善」「エージェント間経済」「特定領域への深層適用」へと急速にシフトしていることを示唆しています。特にロボティクスや垂直領域 AI に対する具体的な支援策（無料展示スペースなど）は、これらの分野の実装と普及を加速させる重要なインフラ整備と言えます。

編集コメント

昨年の成功を受け、今年はより具体的な技術トレンド（自己改善ループやエージェント経済）と実装事例（垂直領域 AI）に焦点を当てたプログラム構成となっています。特にロボティクス分野への無料展示スペース提供は、物理世界との統合を目指すスタートアップにとって大きな追い風となるでしょう。

TL;DR: 今年夏開催の AIE World's Fair のスピーカー募集第 2 弾を発表します。こちらから応募してください：https://sessionize.com/aiewf2026/ 特に、Autoresearch（自己研究）、Memory（記憶）、World Models（世界モデル）、Tokenmaxxing（トークン最大化）、Agentic Commerce（エージェント型商取引）、および法律・ヘルスケア・GTM（Go-To-Market 戦略）・金融分野における Vertical AI（垂直特化型 AI）といった、新たに設けたトラックに関連するプロジェクトをお持ちの方は、ぜひご応募ください！

1 月に「Scaling without Slop（ノイズなしのスケーリング）」に関する計画を提示しましたが、コンテンツ枯渇のリスクが一部指摘されるものの、反響は好意的で、AIE の視聴者数は 2025 年のピーク時を少なくとも倍増させる傾向にあり、現在では月間 100 万人以上のユニークな AI エンジニアにリーチしています。

image

今年はモスコーン・ウェスト（Moscone West）で開催する初年度であり、3 年連続で規模を倍増させ、AI エンジニアリングの世界全体をサンフランシスコに集結させるという使命を果たします。ここでは、今年知っておくべき研究やプロダクトエンジニアリングの成果を紹介するだけでなく、採用、資金調達、ビジネス契約の締結も目指しています。販売は順調ですが、例年通り、世界博覧会（World's Fair）に向けて年に一度呼びかけを行い、普段は登壇を考慮しない人々にも参加の機会を広げるように努めています（「興味があるとは知らなかった」という理由で投稿を見送ってしまう方々もいるためです）。

今年、スケジュールに丸一日分の講演を追加します。2025年および欧州版で取り上げた永続的なテーマに加え、私が特に申請（そしてスポンサー！）を募っている新しいトピックをいくつか追加します：

Autoresearch: ハンズオンやモデルトレーニングにおける再帰的自己改善ループ！

Tasteful Tokenmaxxing: 企業リーダーとして、AI エンジニアリングチームの AI ネイティブ化を10倍にし、AI 採用を拡大しつつも、グッドハート（Goodhart）による浪費を防ぐにはどうすればよいでしょうか？

Memory: ユーザーがエージェントやモデルを使用するにつれて、それらはどのように改善されているのでしょうか？

World Models: 空間知能と敵対的推論の問題をどのように解決しているのでしょうか？

Agentic Commerce: エージェントはデータ、API、および他のエージェントに対してどのように支払いを行っているのでしょうか？

Law, Healthcare, GTM（Go-To-Market）、Finance における垂直領域 AI：これらの特定のドメインで AI をどのように適用しているのでしょうか？政府向け AI や教育向け AI の投稿も受け付けていますが、一般的にこれらはあまり急速には進展していないようです。

Robotics: 昨年は、Physical Intelligence、Waymo、Tesla、Nvidia、K-Scale（RIP）などが自律性へのアプローチを発表しました。今年は、優れたロボティクスデモのために無料の展示フロアスペースを割り当てます。（デモエリアの設定は hello@ai.engineer までご連絡ください。ヒューマノイド型ロボットには必ず同席者が必要です。）

Founders: シリーズ A 前の企業を上場予定のトップ VC やゲスト審査員の前でピッチできる、新しいスタートアップバトルフィールドイベントが追加されます。

他にも新しいトラックがありますので、完全な応募フォームでご確認ください（トラックに縛られず、最高の作品を提出していただければ、私たちが適切な場所を見つけます）

image

すでに Wave 1 で応募し、採用された方は、その旨をお知らせするメールが受信トレイに届いているはずです。もし届いていない場合はご安心ください。Wave 2 でも引き続き審査対象となりますので、追加のアクションは不要です。

これは、今年最大の技術系 AI イベントへの応募を募集していることを知らなかった方々すべてに向けたものです。特に、これらの話題について完璧な講演者を知っている方がいれば、その方に届けるためにあなたの協力を必要としています。

こちらから応募し、チケットや渡航の手配を至急行ってください（その週にサンフランシスコで開催されるワールドカップも混雑が激しく、すぐに埋まってしまうためです）。採用された方には費用を全額返金いたします。また、国際ビザ用の招待状が必要な場合は、hello@ai.engineer までご連絡ください。

2026 年 4 月 30 日〜5 月 1 日の AI ニュース。12 のサブレッドと 544 の Twitter（X）投稿を確認しました。Discord はさらに確認していません。AINews のウェブサイトでは、過去のすべての号を検索できます。念のため、AINews は現在 Latent Space の一部となっています。メールの受信頻度を選択・解除することができます！

AI Twitter レビュー

Grok 4.3 のリリース、ベンチマークの差分、そしてオープン vs クローズドの最前線

xAI は、コスト対パフォーマンスが大幅に改善された Grok 4.3 をリリースしましたが、評価結果は賛否両論でした。初期の議論では、@scaling01 による API の即時ローンチの噂が浮上し、その後 Artificial Analysis によって詳細なベンチマーク分析が発表されました。同社のインテリジェンス指数（Intelligence Index）において、Grok 4.3 はスコア 53 を記録し、Grok 4.20 より 4 ポイント上昇しました。入力コストは約 40% 低下し、出力コストも約 60% 低下しています。最大の改善点は GDPval-AA（GDPval-AA）で、Elo レーティングが 321 上昇して 1500 に達し、現実世界におけるエージェントタスクのパフォーマンス強化を示唆しています。また、τ²-Bench Telecom では 98% のスコアを達成し、IFBench でも 81% を維持しました。その代償として、AA-Omniscience（AA-Omniscience）の精度は向上したものの、ハルシネーション（幻覚・誤情報生成）の抑制率は 8 ポイント低下し、能力が強化されたにもかかわらず信頼性への懸念が残っています。Arena はすでに @arena を通じて、テキスト、ビジョン、ドキュメント、コードの各モードで Grok 4.3 の利用を開始しています。

コミュニティの反応は、「意味のある進化」と「依然としてトップオープンモデルに遅れをとっている」の二極化しました。いくつかの投稿では、Grok が批判者が認める以上に急速に進化していると主張する声があり、@teortaxesTex はトークン効率性の向上にも言及しています。一方で、懐疑的な意見も根強くあります。@scaling01 は「Grok-4.3 は依然として中国製オープンソースモデルに劣る」と主張し、Andon Labs は Vending-Bench 2（Vending-Bench 2）で重大な後退が報告されたと発表しました。そこでは Grok が行動するよりも「睡眠」を選ぶ傾向があるとされています。より構造的な批判は価格設定とインフラの経済性に関するものでした。@teortaxesTex は、Grok の低価格が不良なハードウェア利用率によって補助されている可能性を指摘し、キャッシュ経済（cache economics）がモデル品質だけでなく、エージェントの総所有コスト（TCO: Total Cost of Ownership）を決定する要因としてますます重要になっていると論じました。

DeepSeek V4 Pro、ビジョン/空間推論、そしてオープンウェイトの格差縮小

DeepSeek V4 Pro は、このバッチにおいて最も信頼性の高いオープンウェイトのコーディング/エージェントモデルであるように見えます：最も強力な実地レポートは @omarsar0 氏によるもので、同氏は Pi コーディングエージェント内で DeepSeek-V4-Pro をテストし、マルチターン・アジェンティック・コーディングにおいて Codex や Claude Code に実際に匹敵する最初のオープンウェイトモデルであると記述しました。主要なシステム詳細には、1M のコンテキスト（文脈）、ハイブリッド CSA/HCA アテンション設計、KV キャッシュの 10% への削減、および長文コンテキストにおける推論 FLOPs がほぼ 4 倍低下することが含まれます。また、このレポートでは実用的なハーンチの適合性も強調されました：カスタムセットアップは不要で、安定したトレースが可能であり、Fireworks 推論上で多段階のリサーチ/コーディングループが実行可能です。

より広範なベンチマークの状況は、オープンウェイトが現在大幅に近づいていることを確認していますが、最も困難なタスクでは依然として劣位にあります：Artificial Analysis によると、先週リリースされた 3 つの主要なオープンウェイトモデル—Kimi K2.6、MiMo V2.5 Pro、および DeepSeek V4 Pro—は、インテリジェンス指数で 52〜54 のスコアを獲得しており、一方 Gemini 3.1 Pro Preview と Claude Opus 4.7 は 57、GPT-5.5 は 60 です。これらのトップオープンモデルはいずれも、許可されたライセンスを持つトリリオン超の MoE（Mixture of Experts）システムです：Kimi は 1T/32B アクティブ、MiMo は 1T/42B アクティブ、DeepSeek V4 Pro は 1.6T/49B アクティブです。残りの格差は、HLE、CritPt、TerminalBench Hard、および幻覚（ハルシネーション）が深刻な Omniscience に集中しています。

DeepSeek のマルチモーダル方向性は、明示的な空間的グラウンディングを中心に据えているようです。@teortaxesTex 氏による推測では、DeepSeek-Vision が実際の空間推論能力により ARC-AGI-2 において V4-Pro を上回るとされています。後に ZhihuFrontier から一時的に投稿され削除された技術レポートの要約によると、「思考しながら指し示す」ことができるマルチモーダル CoT（Chain of Thought）システムが紹介されました。これは、推論トレース内に直接埋め込まれたボックスやポイントを用いて、数え上げ、迷路解決、経路追跡における「参照ギャップ」を軽減するものです。このスタックには DeepSeek-ViT、CSA 圧縮、V4-Flash（総パラメータ 284B / アクティブ 13B）が採用されていると報じられています。初期テストでもまだ弱点が見られるものの、これは注目に値するアーキテクチャ上の賭けです。視覚的推論を単なるテキスト記述ではなく、明示的なグラウンディング計算へと転換させるというものです。

Codex の急速な製品拡大 vs Claude Code, Devin, およびその他のエージェントランタイム

Codex は、単なるベースモデルの品質だけでなく、製品速度と UX の洗練さにおいて勝利しています：ツイート全体に共通する主要なテーマは、Codex アプリがどれほど迅速に進化しているかという点でした。@gdb や @theo といったユーザーから、他社製品と比較してその使い心地が優れているとの高エンゲージメントの称賛が寄せられました。OpenAI は、@JamesZmSun によると、レスポンシブテスト用のデバイスツールバーを追加し、「バイブテスト」においてブラウザ使用速度を約 30% 向上させました。また、@reach_vb を通じてチャット内に CI ステータスを表示する機能を追加し、OpenAI 経由で設定・プラグイン・エージェントの移行・インポートツールも提供しました。さらに、@OpenAIDevs によって Codex 内で予想外にバイラルとなったペットシステムも実装されました。 whimsical（おとぎ話のような）要素はありますが、ユーザーから繰り返し指摘された点は、OpenAI が単なるモデルエンドポイントではなく、統合された環境を提供しているという点です。

Codex と Claude Code の対比は、ますます UX・速度・品味のトレードオフとして語られるようになっています：@theo は現在のコーディング界隈の最前線を要約し、GPT-5.5 は「より賢く、あなたの詰まりを解消できる」一方で、Opus 4.7 は意図や品味において優れているが、迷走する傾向があると述べました。2 つ目の投稿では、Claude Code は TTFT（First Token Time）や TPS（Tokens Per Second）の観点で明らかに遅く、より多くのツール呼び出しを必要とする一方、GPT/Codex は「高速モード」のような使い方に適しており、より直接的で経済的であると主張しました（ツイート）。しかし、公開されたベンチマーク比較結果は賛否両論です：@scaling01 は、Claude Code のハーンネス（評価枠組み）における PostTrainBench で GPT-5.5 が Opus 4.7 に勝てなかったと指摘し、結果がいかに評価枠組みに依存しているかを浮き彫りにしました。

他のエージェントランタイムも同様のプリミティブに収束しています：Devin は @cognition 経由で「シェル内」ホットキーアクセスを開始しました。Hermes は @Teknium 経由で、完了するまでエージェントを継続させるよう強制するスーパーバイザーモデルを持つ /goal ループを追加しました。@FredKSchott が紹介した Flue は、「Claude Code のようにプログラム可能な」ヘッドレス自律型エージェント向けの TypeScript フレームワークとして位置づけられています。これらの発表に共通するパターンは、競争の場が生のモデル知能からエージェントハッチ設計へと移行していることです：サブエージェント、ブラウザ使用、永続状態、圧縮、スキル、フィードバックループです。

エージェントインフラストラクチャ：検索、メモリ、ヒューマン・イン・ザ・ループ（HITL）、および永続実行

最も強力な研究シグナルは、エージェントシステムがモデルの品質だけでなく、ランタイム設計によってボトルネックに陥っているという点でした。特に有用な 2 つの研究論文が注目されました。まず、@omarsar0 が要約した ReaLM-Retrieve は、推論モデルは事前のみならず推論中にも検索が必要であると主張しています。これは標準的な RAG（Retrieval-Augmented Generation）に対して絶対値で +10.1% の F1 スコア向上を報告し、固定間隔の IRCoT 法と比較して検索呼び出し回数が 47% 減少、1 回の検索あたりのオーバーヘッドが 3.2 倍低減されています。次に、@dair_ai が共有した OCR-Memory は、長期にわたる軌跡を画像としてインデックス付きアンカーと共に保存し、劣化のあるテキスト要約ではなく正確な過去のコンテンツを検索します。これは厳格なコンテキスト制限下で Mind2Web および AppWorld において SOTA（State-of-the-Art）を達成しています。

LangChain/LangGraph は、マルチユーザーおよび人間をループに組み込んだエージェントのための本番環境用プリミティブに注力しました。@sydneyrunkle 氏は、データ分離、委任された認証情報、オペレータの RBAC（ロールベースアクセス制御）という 3 つの具体的なマルチユーザー展開上の懸念点を提示し、それぞれを LangSmith Agent Server の機能にマッピングしました。後の投稿では、人間の返信をツール結果として直接返却できる新しい HITL モードや、重要なアクションや未解決の判断呼び出しに対する永続的な一時停止/再開のセマンティクス（意味論）について取り上げられました（ツイート）。これは、実際の展開における複雑性がどこへ向かっているのかを示す良いスナップショットです：認証境界、永続状態、明示的な介入ポイント。

永続的実行は、あらゆるスタックにおいてファーストクラスのランタイム機能へと進化しています。Cloudflare は @celso 氏を通じて、エージェントプランに永続的実行を追加するための Dynamic Workflows を発表しました。LangChain は、Deep Agents の下層にある低レベルプリミティブとして create_agent を位置づけ、@Vtrivedy10 氏により、ファイルシステム、bash、コンパクション（圧縮）、フック、サブエージェントへの拡張性を備えています。メタ的なポイントは、関連する技術ブログと一致しており、エージェントランタイム自体—サンドボックス化、再生、チェックポイント、オーケストレーション—が隠れた技術的負債となり、差別化の主要な源泉となっているという点です。

ブックマークすべき研究およびシステム論文

再帰的/潜在空間におけるマルチエージェント協調は、テキストベースのエージェント間の会話に代わる真剣な代替手段として台頭しています：@omarsar0 は、エージェントが完全な自然言語のやり取りではなく、共有された潜在的な再帰計算を通じて通信する「再帰的多エージェントシステム」を要約しました。報告されている成果には、9 つのベンチマーク全体で平均精度が 8.3%向上し、エンドツーエンドの速度が 1.2 倍から 2.4 倍に向上し、トークン数が 34.6% から 75.6%削減されたことが含まれます。もしエージェント間通信のコストが支配的になる場合、この研究ラインは極めて重要となります。

Meta FAIR の「自己改善型事前学習」というアイデアは、一連のトレーニング時論文の中でも特に重要なものの一つとなる可能性があります：@omarsar0 は、強力な事後学習モデルが事前学習の接尾辞をより安全で高品質な続行へと書き換え、その後 RL（強化学習）スタイルの事前学習中にモデルのロールアウトを評価する方法を紹介しました。報告されている改善点には、標準的な事前学習と比較して事実性が 36.2%相対的に向上し、安全性が 18.5%向上し、生成品質における勝率が最大で 86.3%に達したことが含まれます。

Microsoft の合成された長期ホライズンのコンピュータ利用世界は、信頼性の高いデータレシピのように見えます：@dair_ai は、現実的なファイルとドキュメントを備えた 1,000 台の合成コンピュータを作成し、平均して 2,000 トーン以上のターンを含む 8 時間のエージェントシミュレーションを実行するシステムについて説明しました。その主張は明快かつ重要です：コンピュータ利用型エージェントにとって、ボトルネックはもはやモデル能力だけでなく、スケーラブルで現実的な経験データにあるのです。

エンゲージメント上位のツイート

OpenAI/Codex の勢い：OpenAI は GPT-5.5 がこれまでの中で最も強力なリリースであると発表し、API 収益は過去のリリースに比べて 2 倍の速度で成長、Codex も 7 日未満で収益が倍増したと述べています。

防衛・政府部門での採用：米国の「戦争省」CTO が、機密ネットワーク上で機能を展開するため、7 つの最先端 AI およびインフラ企業との合意を発表しました。

OpenAI の労働に関するメッセージ転換：サム・アルトマンは、「我々は人々を代替する存在ではなく、人々を補完し高めるためのツールを構築したい」と述べ、雇用と将来の仕事についての追加コメントもここで発表されています。

Codex の採用と満足度：@gdb 氏より「Codex アプリが信じられないほど素晴らしいものになっている」との声があり、さらに Codex のペット機能が予期せぬ形で当日の最大の製品エンゲージメントヒットの一つとなりました。

モデルベンチマークの実態確認：ARC プライズは、ARC-AGI-3 において GPT-5.5 が 0.43%、Opus 4.7 が 0.18% と報告し、失敗モードの分析も含まれています。

AI Reddit レビュー

/r/LocalLlama + /r/localLLM レビュー

Qwen モデルの開発とベンチマーク

PFlash：RTX 3090 で 128K のコンテキスト長において、llama.cpp を用いた場合と比較して 10 倍のプレフィル速度向上を実現（アクティビティ数：339）：本投稿では、量子化された 27B モデルを対象とした長期コンテキストデコーディングのための推測型プレフィル手法「PFlash」を紹介しています。これは C++/CUDA で実装され、RTX 3090 上で従来の llama.cpp と比較して 10 倍の高速化を達成します。この手法は、小さなドラフターモデル（drafter model）を用いてトークンの重要度を評価し、メインモデルが重要なスパンにのみ集中させることで、プレフィル時間を大幅に短縮します。実装には、最近の研究で提案された推測型プレフィルとブロックスパースアテンションの知見を組み合わせ、Python や PyTorch を一切使用せず C++/CUDA のみで完結させているため、RTX 3090 などのコンシューマー向け GPU でも効率的に動作します。リポジトリは GitHub で公開されています。一部のコメントでは、主張されている 10 倍の速度向上に対して懐疑的な意見が示されており、あるユーザーはその圧縮手法により「極めて情報損失が大きい（super lossy）」可能性を指摘しています。また、別のユーザーは RTX 4090 でメモリ不足（out-of-memory）が発生したと報告しており、結果の再現には潜在的な課題があることを示唆しています。

randomfoo2 は、PFlash の革新的なアプローチについて言及しています。これは、より小さな Qwen3-0.6B ドラフターを使用して、FlashPrefill や BSA スタイルのスパースアテンション（sparse attention）で 64K/128K のフルプロンプトを処理し、計算コストを削減するものです。ドラフターはトークンやスパンの重要度を評価し、27B ターゲットモデルがプリフィルするために必要な重要なサブセットのみを保持します。その後、圧縮されたターゲット KV に対して DFlash と DDTree を用いた推測デコーディング（speculative decoding）が行われます。この手法は「極めて損失率が高い」と評されており、速度向上のために精度とのトレードオフが生じる可能性があることを示唆しています。

qwen_next_gguf_when は、PFlash メソッドの実用性について懸念を表明し、DFlash コンポーネントが RTX 4090 でメモリ不足（OOM: Out Of Memory）を起こしやすいと指摘しています。これは、ハードウェアの互換性や効率性に潜在的な制限があることを示唆しており、異なるシステム間での再現可能性やスケーラビリティに影響を与える可能性があります。

Obvious-Ad-2454 は、主張されている 10 倍の速度向上に対して懐疑的な見解を示し、独立した検証なしでは楽観的すぎる可能性があるとしています。このコメントは、特にこのような顕著な改善が報告される場合、機械学習における性能主張を検証するために再現研究が重要であることを強調するものです。

Qwen 3.6 27B と Gemma 4 31B の比較 — パックマンゲーム作成！（アクティビティ：994）：ローカル LLM ゲーム開発コンテストにおいて、MacBook Pro M5 Max（RAM 64GB）上でパックマン風ゲームを作成するタスクで、Gemma 4 31B が Qwen 3.6 27B を上回りました。Gemma は秒間 27 トークンを処理し、6,209 トークンで 3 分 51 秒で完了しました。一方、Qwen は秒間 32 トークンを処理しましたが、33,946 トークンを使って 18 分 04 秒を要しました。Qwen の出力はより創造的で視覚的にスタイリッシュでしたが、Gemma の解決策は短く、明確で論理的であり、ゲームロジック、インタラクション処理、パフォーマンスの安定性において優れていました。このタスクでは、外部ライブラリを使用せず手描き（プロシージャル）グラフィックスを含む完全な HTML ベースのゲームを生成し、requestAnimationFrame とデルタタイム（delta time）を用いて滑らかなゲームプレイと安定したパフォーマンスを実現することが求められました。コメント欄では、「バグなし」というプロンプトの要求に対するユーモアが指摘され、曖昧なプロンプトの実用性が疑問視されました。これらは主にモデルの既存知識をテストするものであり、問題解決能力を試すものではないという意見も示されています。

Qwen 3.6 27B は、単一の HTML ページと必要と判断したあらゆるライブラリやグラフィックスソースを使用してパックマンのクローンを作成するタスクを与えられました。興味深いことに、このモデルは外部ダウンロードや調査を行わず、既存の知識に基づいてゲームをコーディングしました。これは、最小限のプロンプトから機能的なコードを生成するモデルの能力を示していますが、その理解の深さや新しいリソースへの適応力については疑問を投げかけています。

あるユーザーは、Pacman ゲームのGemma 4 31Bバージョンにおけるゴースト敵の動きに不具合があることを指摘しました。これは、特に動的要素である敵AIを扱う際に、モデルがゲームロジックを正確に実装する能力に潜在的な問題がある可能性を示唆しています。これはPacmanのようなゲームにとって極めて重要です。

この議論は、AIモデルのテストにおいて曖昧なプロンプトを使用することの有効性について懸念を提起しています。あるコメント投稿者は、そのようなプロンプトを「ベンチマーク最大化（benchmaxxing）テスト」と表現しました。これは、これらのテストがモデルの問題解決能力や新しいタスクへの適応力を効果的に評価するものではなく、むしろ既存の知識ベースを評価するものであることを示唆しています。

Qwen-Scope：Qwen 3.5 モデル用の公式スパース自己符号化器（SAEs）（アクティビティ数：437）：Qwen チームは、2B から 35B の MoE（Mixture of Experts）モデルにわたる Qwen 3.5 モデル向けのスパース自己符号化器（Sparse Autoencoders: SAEs）のセットである「Qwen-Scope」をリリースしました。このツールは全層にわたる内部特徴をマッピングし、モデルの内部概念の辞書として機能します。これにより、「法的な議論」や「Python コード」といった特定の特徴を精密に操作することが可能になります。主な機能には、特定の機能を抑制するための外科的アビリテーション（Surgical Abliteration）、望ましい概念を活性化させるための特徴ステアリング（Feature Steering）、トークントリガーによる方向性を特定するためのモデルデバッグ（Model Debugging）、および特徴の活性化を検証するためのデータセット分析（Dataset Analysis）が含まれます。このツールは Apache 2.0 ライセンスの下でリリースされていますが、安全性フィルターの除去については注意喚起がなされています。実用的な例としては、ヒートマップを用いて過剰に活性化された機能を特定し、予期しない言語の切り替えを診断するケースがあります。詳細は Qwen-Scope の論文および Hugging Face Space で確認できます。コメント欄では、このリリースの重要性が強調されており、規模において Google の GemmaScope を上回る可能性のある、密なモデル向けの最大級のオープンソース解釈性ツールであるとの指摘があります。また、Qwen 3.6 など将来のバージョンでも同様の機能が組み込まれることへの期待も示されています。

NandaVegg は、密な 27B Qwen モデルに対するスパース自己符号化器（Sparse Autoencoders: SAEs）のリリースの重要性を強調し、これがこれまでのオープンソース解釈性ツールの中で最大規模のものとなる可能性があると指摘しています。これは、9B や 2B といった小規模モデルのみをサポートしていた従来のツールである GemmaScope と対照的であり、モデルの解釈性能力における大幅な進歩を示唆するものです。

robert896r1 は、Qwen 3.6 のリリースまたは現在のツールのコミュニティ主導による適応版の公開を期待しています。

原文を表示

TL;DR: we are announcing Wave 2 Call for Speakers for AIE World’s Fair this summer - apply here: https://sessionize.com/aiewf2026/ ESPECIALLY if you have projects relevant to our new tracks in Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI in Law, Healthcare, GTM and Finance!

In January we laid out plans for Scaling without Slop and despite some content exhaustion risk, your reception has been positive, with AIE viewership now trending to at least double 2025’s peak, serving over a million unique AI engineers a month.

This year is our first in Moscone West, doubling for the 3rd year in a row in our mission to bring all of the AI Engineering world to San Francisco to showcase the must-know research and product engineering work of the year, as well as to hire, fundraise, and close business deals. Sales are going well, but traditionally we do one callout a year for the World’s Fair to widen our net for people who might not traditionally think to submit a talk (because they didn’t know we were interested!).

This year we are adding an entire day’s worth of talks to the schedule, so on top of the all the evergreen themes we covered in 2025 and in Europe, we’re adding a few more new ones that I am specifically soliciting applications (and sponsors!) to cover:

Autoresearch: recursive self improvement loops in harnesses and model training!

Tasteful Tokenmaxxing: as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?

Memory: how are your agents/models improving as your users use them?

World Models: how are you solving spatial intelligence and adversarial reasoning?

Agentic Commerce: how are agents paying for data, APIs, and other agents?

Vertical AI in Law, Healthcare, GTM and Finance: how are you applying AI in these specific domains? We are also open to submissions for AI in Government and AI in Education, though generally these seem less fast-moving.

Robotics: last year, Physical Intelligence, Waymo, Tesla, Nvidia, K-Scale (RIP) and others presented their approaches to autonomy; this year WE ARE ALLOCATING FREE EXPO FLOOR SPACE FOR GOOD ROBOTICS DEMOS. (contact hello@ai.engineer to set up your demo area! Humanoids must be accompanied.)

Founders: a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.

There are other new tracks, which you can find in the full application form (don’t constrain yourself to tracks, just submit your best work and we’ll find a place for you)

If you already applied and were accepted in Wave 1, you should receive an email in your inbox informing you so - if not, don’t fret, you’ll still be considered in Wave 2, no further action needed.

This is for everyone else who weren’t aware we are soliciting applications for the biggest technical AI event of the year - especially if you know someone who would be PERFECT to talk about some of these topics we are calling out, then we need your help to reach them.

Apply here - and book your ticket/travel asap (because things are filling up fast for the World Cup also taking place in SF that week) — we will refund successful applicants. (Also contact hello@ai.engineer if you need an invitation letter for international visa).

AI News for 4/30/2026-5/1/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Grok 4.3’s Release, Benchmark Deltas, and the Open-vs-Closed Frontier

xAI shipped Grok 4.3 with materially better cost/performance, but mixed eval reception: Early chatter flagged an imminent API launch from @scaling01, followed by a detailed benchmark breakdown from Artificial Analysis. On their Intelligence Index, Grok 4.3 scores 53, up 4 points over Grok 4.20, with roughly 40% lower input and 60% lower output pricing. The biggest gain was on GDPval-AA, up 321 Elo to 1500, suggesting stronger real-world agentic task performance. It also hit 98% on τ²-Bench Telecom and held 81% on IFBench. The tradeoff: AA-Omniscience accuracy rose while non-hallucination dropped by 8 points, leaving concerns about reliability despite stronger capability. Arena has already added it across text, vision, document, and code modes via @arena.

Community reaction was split between “meaningful iteration” and “still behind top open models”: Several posts argued Grok is improving faster than critics admit, including @teortaxesTex, who noted token-efficiency gains as well, while others were more skeptical. @scaling01 claimed “Grok-4.3 still behind chinese open-source”, and Andon Labs reported a major regression on Vending-Bench 2, where Grok allegedly preferred to “sleep” rather than act. The more structural critique came from pricing and infra economics: @teortaxesTex argued Grok’s low prices may be subsidized by poor hardware utilization and that cache economics, not only model quality, increasingly determine agentic TCO.

DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap

DeepSeek V4 Pro appears to be the most credible open-weight coding/agent model in this batch: The strongest hands-on report came from @omarsar0, who tested DeepSeek-V4-Pro inside the Pi coding agent and described it as the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding. Key systems details included 1M context, a hybrid CSA/HCA attention design, KV cache reduced to 10%, and nearly 4x lower inference FLOPs at long context. The report also emphasized practical harness fit: no custom setup, stable traces, and viable multi-step research/coding loops on Fireworks inference.

The broader benchmark picture confirms open weights are now much closer, though still behind on hardest tasks: Artificial Analysis noted that the three leading open-weight models released last week—Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro—now score 52–54 on the Intelligence Index, versus 57 for Gemini 3.1 Pro Preview and Claude Opus 4.7, and 60 for GPT-5.5. These top open models are all trillion-plus MoE systems with permissive licenses: Kimi at 1T/32B active, MiMo at 1T/42B active, and DeepSeek V4 Pro at 1.6T/49B active. The remaining gap is concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience.

DeepSeek’s multimodal direction seems centered on explicit spatial grounding: Speculation about DeepSeek-Vision outperforming V4-Pro on ARC-AGI-2 because of actual spatial reasoning came from @teortaxesTex. A later summary of a briefly posted-and-deleted tech report from ZhihuFrontier described a multimodal CoT system that can “point while thinking” using boxes and points embedded directly into reasoning traces to reduce the “reference gap” in counting, maze solving, and path tracing. The stack reportedly uses DeepSeek-ViT, CSA compression, and V4-Flash (284B total / 13B active). Even if early tests still show weaknesses, it is a notable architectural bet: turning visual reasoning into explicit grounded computation rather than plain text description.

Codex’s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes

Codex is winning on product velocity and UX polish, not just base model quality: A major theme across tweets was how quickly the Codex app is improving. High-engagement praise came from @gdb, @theo, and others comparing its feel favorably to alternatives. OpenAI added a device toolbar for responsive testing and improved browser-use speed by ~30% in “vibe testing,” per @JamesZmSun. It also added CI status in chat via @reach_vb, migration/import tooling for settings/plugins/agents via OpenAI, and a surprisingly viral pets system in Codex via @OpenAIDevs. While whimsical, the repeated point from users was that OpenAI is shipping a cohesive environment, not just a model endpoint.

Codex vs Claude Code is increasingly framed as UX + speed + taste tradeoffs: @theo summarized the current frontier coding vibe: GPT-5.5 is “smarter and can unblock you,” while Opus 4.7 has better intent/taste but can wander. In a second post, he argued Claude Code feels much slower on TTFT/TPS and requires more tool calls, while GPT/Codex feels more direct and economical for “fast mode” style use (tweet). Still, public benchmark comparisons are mixed: @scaling01 said GPT-5.5 did not beat Opus 4.7 on PostTrainBench in the Claude Code harness, highlighting how much results remain harness-dependent.

Other agent runtimes are converging on similar primitives: Devin launched “inside your shell” hotkey access via @cognition. Hermes added a /goal loop with a supervisor model forcing the agent to continue until completion, via @Teknium. Flue, introduced by @FredKSchott, positions itself as a TypeScript framework for headless autonomous agents, “like Claude Code but programmable.” The common pattern across these launches is that the competitive surface is moving from raw model IQ to agent harness design: subagents, browser-use, durable state, compaction, skills, and feedback loops.

Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution

The strongest research signal was that agent systems are bottlenecked by runtime design, not just model quality: Two especially useful papers were highlighted. First, ReaLM-Retrieve, summarized by @omarsar0, argues that reasoning models need retrieval during inference rather than only before it. It reports +10.1% absolute F1 over standard RAG and 47% fewer retrieval calls than fixed-interval IRCoT, with 3.2x lower per-retrieval overhead. Second, OCR-Memory, shared by @dair_ai, stores long-horizon trajectories as images with indexed anchors, retrieving exact prior content instead of lossy text summaries; it reports SOTA on Mind2Web and AppWorld under strict context limits.

LangChain/LangGraph pushed hard on production primitives for multi-user and human-in-the-loop agents: @sydneyrunkle outlined three concrete multi-user deployment concerns—data isolation, delegated credentials, and operator RBAC—and mapped each to LangSmith Agent Server features. Later posts covered a new HITL mode where a human reply can be returned directly as a tool result (tweet) and durable pause/resume semantics for consequential actions or unresolved judgment calls (tweet). This is a good snapshot of where real deployment complexity is moving: auth boundaries, persistent state, and explicit intervention points.

Durable execution is becoming a first-class runtime feature across stacks: Cloudflare announced Dynamic Workflows for adding durable execution to agent plans via @celso. LangChain positioned create_agent as the low-level primitive beneath Deep Agents, with extensibility for filesystems, bash, compaction, hooks, and subagents via @Vtrivedy10. The meta-point is consistent with one linked technical blog: the agent runtime itself—sandboxing, replay, checkpointing, orchestration—has become hidden technical debt and a major source of differentiation.

Research and Systems Papers Worth Bookmarking

Recursive / latent-space multi-agent coordination is emerging as a serious alternative to text-only agent chatter: @omarsar0 summarized Recursive Multi-Agent Systems, where agents communicate through shared latent recursive computation instead of full natural-language exchanges. Reported gains: 8.3% average accuracy improvement, 1.2x–2.4x end-to-end speedup, and 34.6%–75.6% token reduction across nine benchmarks. If agent-to-agent communication cost becomes dominant, this line of work matters.

Meta FAIR’s “self-improving pretraining” idea may be one of the more consequential training-time papers in the batch: @omarsar0 highlighted a method where a strong post-trained model rewrites pretraining suffixes toward safer, higher-quality continuations and then judges model rollouts during RL-style pretraining. Reported improvements include 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining.

Microsoft’s synthetic long-horizon computer-use worlds look like a credible data recipe: @dair_ai described a system that creates 1,000 synthetic computers with realistic files and documents, then runs 8-hour agent simulations averaging 2,000+ turns. The thesis is straightforward and important: for computer-use agents, the bottleneck is no longer only model capability but scalable, realistic experiential data.

Top tweets (by engagement)

OpenAI/Codex momentum: OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days.

Defense/government adoption: The U.S. “Department of War” CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks.

OpenAI messaging pivot on labor: Sam Altman: “we want to build tools to augment and elevate people, not entities to replace them”, with follow-up comments on jobs and future work here.

Codex adoption and delight: “codex app becoming incredible” from @gdb, plus Codex pets unexpectedly becoming one of the day’s biggest product-engagement hits.

Model benchmarking reality check: ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Qwen Model Developments and Benchmarks

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (Activity: 339): The post introduces PFlash, a speculative prefill technique for long-context decoding on quantized 27B targets using C++/CUDA, achieving a 10x speedup over vanilla llama.cpp on an RTX 3090. This method leverages a small drafter model to score token importance, allowing the main model to focus only on significant spans, thus reducing prefill time significantly. The implementation combines insights from recent papers on speculative prefill and block-sparse attention, and is executed entirely in C++/CUDA without Python or PyTorch, making it efficient for consumer-grade GPUs like the RTX 3090. The repository is available on GitHub. Some commenters express skepticism about the claimed 10x speedup, with one noting the approach as potentially ‘super lossy’ due to its compression method. Another user reports out-of-memory issues on a 4090, indicating potential challenges in replicating the results.

randomfoo2 highlights a novel approach in PFlash that involves using a smaller Qwen3-0.6B drafter to process the full 64K/128K prompt with FlashPrefill/BSA-style sparse attention, which reduces the computational cost. The drafter evaluates token/span importance, retaining only a crucial subset for the 27B target model to prefill, followed by speculative decoding using DFlash+DDTree on the compressed target KV. This method is noted for being ‘super lossy,’ indicating potential trade-offs in accuracy for speed.

qwen_next_gguf_when raises concerns about the practicality of the PFlash method, noting that the DFlash component tends to run out of memory (OOM) on an RTX 4090. This suggests potential limitations in hardware compatibility or efficiency, which could impact the method’s replicability and scalability across different systems.

Obvious-Ad-2454 expresses skepticism about the claimed 10x speedup, suggesting it might be too optimistic without independent verification. This comment underscores the importance of replication studies to validate performance claims in machine learning, especially when such significant improvements are reported.

Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (Activity: 994): In a local LLM gamedev contest, Gemma 4 31B outperformed Qwen 3.6 27B in creating a Pac-Man style game on a MacBook Pro M5 Max with 64GB RAM. Gemma processed 27 tokens/sec and completed the task in 3m 51s with 6,209 tokens, while Qwen processed 32 tokens/sec over 18m 04s with 33,946 tokens. Despite Qwen’s more creative and visually styled output, Gemma’s solution was shorter, clearer, and more logical, excelling in game logic, interaction handling, and performance stability. The task required generating a complete HTML-based game with procedural graphics and no external libraries, focusing on smooth gameplay and stable performance using requestAnimationFrame and delta time for animations. Commenters noted the humor in the prompt’s demand for ‘no bugs’ and questioned the utility of vague prompts, suggesting they primarily test a model’s pre-existing knowledge rather than its problem-solving ability.

Qwen 3.6 27B was tasked with creating a Pacman clone using a single HTML page and any libraries or graphics sources it deemed necessary. Interestingly, the model did not perform any external downloads or research, instead relying on its pre-existing knowledge to code the game. This highlights the model’s ability to generate functional code from minimal prompts, though it raises questions about the depth of its understanding and adaptability to new resources.

A user pointed out that the ghost enemy movement in the Gemma 4 31B version of the Pacman game appears to be malfunctioning. This suggests potential issues with the model’s ability to accurately implement game logic, particularly in handling dynamic elements like enemy AI, which is crucial for a game like Pacman.

The discussion raises concerns about the utility of using vague prompts for testing AI models, as noted by a commenter who described such prompts as “benchmaxxing tests.” This implies that the tests may not effectively evaluate the model’s problem-solving capabilities or its ability to adapt to new tasks, but rather assess its pre-existing knowledge base.

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 437): The Qwen Team has released Qwen-Scope, a set of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE. This tool maps internal features across all layers, functioning as a dictionary of the model’s internal concepts, allowing for precise manipulation of features such as ‘legal talk’ or ‘Python code’. Key functionalities include Surgical Abliteration to suppress specific features, Feature Steering to activate desired concepts, Model Debugging to identify token-triggered directions, and Dataset Analysis to verify feature activation. The tool is released under the Apache 2.0 license but with a caution against removing safety filters. A practical example includes diagnosing unexpected language switches using a heatmap to identify over-activated features. More details can be found in the Qwen-Scope paper and the Hugging Face Space. Commenters highlight the significance of this release, noting it as potentially the largest open-source interpretability tool for dense models, surpassing Google’s GemmaScope in scale. There is anticipation for future iterations, such as Qwen 3.6, to incorporate similar tools.

NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool to date. This is in contrast to previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.

robert896r1 expresses anticipation for the release of Qwen 3.6 or community-driven adaptations of the current tools for

この記事をシェア

MIT ML News重要度42026年6月26日 22:00

LLM がロボットに曖昧な指示の理解と重要詳細への集中を支援

TLDR AI2026年6月25日 09:00

Perplexity の弁護士向けコンピューター（3 分読み）

TechCrunch AI2026年6月25日 02:16

Facebook がクリエイター向け AI コンパニオンアプリをリリース

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Latent Space·2026年5月2日 16:21·約25分

AI エンジニア世界博覧会、第 2 回スピーカー募集を開始

#Autoresearch #Agentic Commerce #World Models #Vertical AI #Robotics

TL;DR

AI深層分析2026年5月2日 18:28

重要/ 5段階

深度40%

キーポイント

新設された主要講演トラック

特定業界への垂直統合 AI の強化

ロボティクス展示スペースの無料提供

スタートアップ・バトルフィールドの開催

Grok 4.3 のコストパフォーマンス向上と評価の分裂

コミュニティ反応と構造的な批判

DeepSeek V4 Pro とオープンウェイトの進化

影響分析・編集コメントを表示

影響分析

編集コメント

image

Autoresearch: ハンズオンやモデルトレーニングにおける再帰的自己改善ループ！

Memory: ユーザーがエージェントやモデルを使用するにつれて、それらはどのように改善されているのでしょうか？

World Models: 空間知能と敵対的推論の問題をどのように解決しているのでしょうか？

Agentic Commerce: エージェントはデータ、API、および他のエージェントに対してどのように支払いを行っているのでしょうか？

image

AI Twitter レビュー

Grok 4.3 のリリース、ベンチマークの差分、そしてオープン vs クローズドの最前線

DeepSeek V4 Pro、ビジョン/空間推論、そしてオープンウェイトの格差縮小

Codex の急速な製品拡大 vs Claude Code, Devin, およびその他のエージェントランタイム

エージェントインフラストラクチャ：検索、メモリ、ヒューマン・イン・ザ・ループ（HITL）、および永続実行

ブックマークすべき研究およびシステム論文

エンゲージメント上位のツイート

モデルベンチマークの実態確認：ARC プライズは、ARC-AGI-3 において GPT-5.5 が 0.43%、Opus 4.7 が 0.18% と報告し、失敗モードの分析も含まれています。

AI Reddit レビュー

/r/LocalLlama + /r/localLLM レビュー

Qwen モデルの開発とベンチマーク

robert896r1 は、Qwen 3.6 のリリースまたは現在のツールのコミュニティ主導による適応版の公開を期待しています。

原文を表示

Autoresearch: recursive self improvement loops in harnesses and model training!

Tasteful Tokenmaxxing: as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?

Memory: how are your agents/models improving as your users use them?

World Models: how are you solving spatial intelligence and adversarial reasoning?

Agentic Commerce: how are agents paying for data, APIs, and other agents?

Founders: a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.

There are other new tracks, which you can find in the full application form (don’t constrain yourself to tracks, just submit your best work and we’ll find a place for you)

AI Twitter Recap

Grok 4.3’s Release, Benchmark Deltas, and the Open-vs-Closed Frontier

DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap

Codex’s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes

Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution

Research and Systems Papers Worth Bookmarking

Top tweets (by engagement)

OpenAI/Codex momentum: OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days.

Defense/government adoption: The U.S. “Department of War” CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks.

OpenAI messaging pivot on labor: Sam Altman: “we want to build tools to augment and elevate people, not entities to replace them”, with follow-up comments on jobs and future work here.

Codex adoption and delight: “codex app becoming incredible” from @gdb, plus Codex pets unexpectedly becoming one of the day’s biggest product-engagement hits.

Model benchmarking reality check: ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap