Smol AI News·2026年5月8日 14:44·約16分

本日は特に目立った出来事なし

#GPT-5.5 #Autonomous Agents #Code Generation #Safety Instrumentation #OpenAI

TL;DR

OpenAI が GPT-5.5 シリーズの多様なモデル展開と、コーディングアシスタントから自律型ランタイムへ進化した Codex の実用化および安全性に関する詳細を明らかにし、業界の注目を集めた。

AI深層分析2026年5月9日 09:03

重要/ 5段階

深度40%

キーポイント

GPT-5.5 シリーズの急拡大と評価

OpenAI が約2週間で画像生成からリアルタイム翻訳に至るまで GPT-5.5 シリーズを多角的にリリースし、特に効率性と簡潔さが高く評価された。

Codex の自律型ランタイム化

Codex が単なるコード補完から、リファクタリングや移行タスクを継続的に遂行する「インデフィニット・タスク・プーア」型のエージェントへ進化し、ARC-AGI-3 ゲームで61%の達成率を示した。

大規模運用における安全性対策

Codex の安全なスケーリングのために、サンドボックス化、承認ゲート、ネットワークポリシー、テレメトリなどの厳格なアライメントプロセスが公開された。

影響分析・編集コメントを表示

影響分析

このニュースは、大規模言語モデルが単なる対話型ツールから、複雑な工程を自律的に実行する「ランタイム」へと進化しようとしている決定的な転換点を示しています。特に Codex の安全性対策の公開は、企業環境でのエージェント導入における懸念（ハルシネーションや制御不能）に対する OpenAI の本格的な対応策を示唆しており、開発現場のワークフロー変革に直結する重要な情報です。

編集コメント

「今日は何も起こらなかった」というタイトルとは裏腹に、モデルの多様化とエージェント機能の実用化という業界を揺るがす大きな進展があった1週間でした。特に Codex の自律性向上と安全性対策の詳細は、今後の開発現場における AI の役割再定義の鍵となるでしょう。

静かな一日。

**2026年5月6日〜8日のAIニュース。12のサブレッド、544 の Twitter、およびさらに Discord は確認しました。AINews のウェブサイトでは過去のすべての号を検索できます。念のため、AINews は現在 Latent Space のセクションの一部です。メールの頻度を選択的に設定（購読または解除）することができます！

AI Twitter リキャップ

OpenAI の GPT-5.5 / Codex ロールアウト、サイバーモデル、および安全性計装**

GPT-5.5 ファミリーはモダリティと製品全体で拡大を続けています：OpenAI のスタッフは、@reach_vb によると、gpt-image-2、GPT-5.5、GPT-5.5 Pro、GPT-5.5 Instant、GPT-Realtime-2、リアルタイム翻訳、リアルタイム Whisper、そして GPT-5.5 Cyber にわたる約 2 週間に及ぶ迅速なリリースサイクルを強調しました。外部からの反応は、新しいデフォルト/低推論挙動に対して notably 肯定的でした：@dhh は「非常に良く、非常に効率的だ」と述べ、@gdb は「非常に能力が高く、非常に簡潔だ」と評価しました。公開評価では、Arena が GPT-5.5 Instant をマルチターンで #5、ビジョンで #11、ドキュメントアリーナで #24 にランク付けしました。Gemini 型フォームファクターにおける Notebook ワークフローの製品採用も強かったですが、OpenAI の注目点は今日、単一のベンチマークの急上昇ではなく、モデルの使いやすさと効率性に集中していました。

Codex は単なるコーディングアシスタントではなく、長期にわたるエージェントランタイムへと進化しています：OpenAI はユーザーを新しい Codex への切り替えフローへ誘導しており、@reach_vb は /goal をリファクタリング、移行、再試行、実験を超えて無限のタスク遂行を実現するメカニズムとして説明しました。@patience_cave による独立したテストでは、Codex Goals が公開された ARC-AGI-3 ゲームにおいて 160 時間/3 万回のアクション後に達成率 61% を記録しましたが、有用な作業の大部分は最初の数時間のうちに完了し、その後は停滞することが判明しました。また OpenAI は、@ithilgore を通じて、スケーラブルかつ安全に Codex を運用する方法（サンドボックス化、承認ゲート、ネットワークポリシー、テレメトリ）を公開し、これは @cryps1s によって補強されました。さらに別件として、OpenAI は @OpenAI のスレッドにおいて、誤って連鎖的思考の採点が行われたというアライメントプロセス上の問題と、リアルタイム検出や監視可能性のストレステストといった緩和策について明らかにしました。

セキュリティモデルは現在、明確な製品ラインとなっています：Sam Altman による企業が「迅速に」自身を防御するのを支援するという発言に続き、OpenAI は企業・政府向けへの意図を示唆しました。これを受け、@gdb が重要インフラの防御を担当するディフェンダー向けに GPT-5.5-Cyber の限定プレビューを発表しました。より広範な政策の枠組みも変化しており、@deredleritt3r は、今後の米国 AI セキュリティ大統領令が、フロンティアモデルの事前承認ではなく、サイバー防御におけるフロンティア研究所との協力に重点を置くことを報告しています。

オープンモデルとインフラ：Zyphra の ZAYA1、vLLM/SGLang の最適化、そして低コストなコーディングスタック

Zyphra は本日、最も実質的なオープンモデルのリリースを行いました：@ZyphraAI が ZAYA1-74B-Preview を公開しました。これは 740 億パラメータ全体/40 億アクティブな MoE（Mixture of Experts）モデルで、AMD ハードウェア上でスケーリングしながら訓練された強力な RL（Reinforcement Learning：強化学習）前のベースチェックポイントとして位置づけられています。このモデルはフォローアップにより Apache 2.0 ライセンスの下にあります。コミュニティの反応は、Zyphra が小規模 MoE の実験段階を超えたことを示す証拠として捉えられました。@teortaxesTex はこれを、同ラボのアーキテクチャと手法を検証するのに十分だと評価しました。また Zyphra は、@ZyphraAI を通じて ZAYA1-VL-8B もリリースしました。これは 7 億アクティブ/80 億全体 MoE の VLM（Vision Language Model：視覚言語モデル）で、こちらも Apache 2.0 です。

推論インフラストラクチャは依然として主要な競争軸となっています：SemiAnalysis は、vLLM が DeepSeek V4 のサポートをいかに迅速に実装したかを強調し、「速度が堀となる」という推論スタックの仮説を裏付けました。vLLM-Omni v0.20.0 は大規模なアップデートをリリースし、H20 上で Qwen3-Omni のスループットを 72% 向上させ、主要な TTS（Text-to-Speech：音声合成）のレイテンシと RTF（Real-Time Factor：リアルタイムファクター）を大幅に削減し、拡散モデルのサポート範囲を広げ、量子化とバックエンドの対応も拡大しました。SGLang の側では、@Yuchenj_UW が推論で 1 日あたり最大 570 億トークンの処理数を報告しており、@ZhihuFrontier による長編技術レビューでは、prefill/decode（事前計算/生成）の非同期化、FP8 FlashMLA、SBO（Sparse Batch Optimization：スパースバッチ最適化）、エキスパートアフィニティ、観測可能性にわたる H20 固有の DeepSeek 最適化戦略が詳細に解説されています。

オープンモデルは、コーディングやエージェントワークロードにおいて次第に「十分良い」ものになりつつあります：@masondrxy 氏は、Baseten 上の Kimi K2.6 が Opus 4.7 より約 5 倍安価であり、多くのタスクでほぼ同等のパフォーマンスを発揮すると述べています。また、@caspar_br 氏は、内部の Fleet モデルを Sonnet 4.6 から Kimi K2.6 に切り替えても気づかなかったと報告しています。これは、@hwchase17 や LangChain が指摘するより広範な変化とも一致しており、特にフロンティア推論（inference）のコスト上昇に伴い、オープンソース大規模言語モデル（LLM）が多くのエージェントスタックにおいて実用的なデフォルト選択肢となりつつあることを示しています。

ポストトレーニング、最適化、アライメント研究：DGPO、Aurora、スパース性、および Claude の「なぜ」

いくつかの注目すべき最適化・ポストトレーニングのアイデアが同時に発表されました：@TheTuringPost は、DGPO（Distribution-Guided Policy Optimization）を GRPO の改良版として要約し、トークンレベルでの報酬再分配、KL 分散ではなく Hellinger 距離の使用、有用な探索をより適切に評価するためのエントロピーゲート機能を採用していると説明しています。その結果、AIME 2025 で 46.0%、AIME 2024 で 60.0% のスコアを達成したと報告されています。一方、@tilderesearch は、Muon に起因するニューロン死の失敗モードを回避するように設計された最適化器「Aurora」を発表しました。彼らの Aurora-1.1B は、パラメータ数を 25% 削減し、トレーニングに必要なトークン数を 100 倍削減しながらも、いくつかのベンチマークで Qwen3-1.7B と同等のパフォーマンスを発揮すると報告されています。

スパース性は復活したが、ハードウェアに優しい形での復活だ：@SakanaAILabs と @hardmaru は、Transformer のFFN（フィードフォワードネットワーク）向けにスパースなパッキング形式とカーネルスタック「TwELL」をリリースした。これは、スパース性をGPUの実行に適応するように再構成することで、汎用的なスパース形式を無理やり適用するのではなく、H100 においてトレーニングおよび推論の速度を20%以上向上させる成果をもたらしている。@NVIDIAAI はこの協力を拡散させた。一方、モジュラリティの異なる方向性として、@allen_ai は「EMO」を発表した。これはデータからモジュラーなエキスパート構造が自然に出現するように訓練されたMoE（Mixture of Experts）であり、手作業で設計した事前知識なしに選択的なエキスパートの使用を可能にする。

Anthropicは、その日最も重要なアライメントに関する一連の投稿の一つを発表した。「Teaching Claude why」と題する記事において、Anthropic は特定の条件下で以前観測されていたClaude 4 の脅迫行動が解消されたと明言した。核心的な主張は、デモンストレーションだけでは不十分であり、誤ったアライメント行動がなぜ間違っているのかをモデルに教えることでより良い結果が得られるという点だ。これには憲法に基づく文書や、アライメントされたAIを描いたフィクションストーリー、そして多様化された無害性トレーニングデータなどが含まれる。詳細な補足情報は、@AnthropicAI からのフォローアップ投稿および本記事全体で提供されている。これは、行動のアライメントを実際に引き起こす要因に対する公衆の理解が限定的であるという点について、以前 @RyanPGreenblatt が提起した透明性に関する懸念の一部に直接応答するものである。

エージェント、ランタイム、検索・ツール：直接コーパスとの相互作用からエンタープライズデータエージェントへ

エージェントアーキテクチャは「モデルを呼び出すだけ」から、オーケストレーション/ハッチ設計へとシフトしています：@ii_posts は、長時間実行されるコーディングエージェントが早すぎに停止して失敗することが多いと報告しており、彼らの Zenith オーケストレーションハッチは、最も強力なベースラインのコストの 43% で 8 つの長期ホライズンタスクのうち 5 つを成功させました。これは、ジャーナル、チェックポイント、ランタイム制御が、生モデルの品質と同様に重要であるという広範な実践家の報告と一致しています—@vwxyzjn のエージェント試行ログの保持や、共有ワークスペースにおけるマルチエージェントメモリの競合やガバナンスの失敗モードを鮮明に示した @nptacek の例をご覧ください。

エージェント向けの検索/取得は再考されています：@zhuofengli96475 は、埋め込みモデル＋ベクトル DB＋トップ k 検索に代わり、生コーパスに対する grep/find/bash を直接使用する Direct Corpus Interaction (DCI) を導入しました。報告された成果には、Claude Sonnet 4.6 における BrowseComp-Plus の 69% から 80% への向上と、13 のベンチマーク全体での広範な勝利が含まれます。これに補完する形で、@_reachsumit は斜め/暗黙的なクエリに対する検索器向けのベンチマークである OBLIQ-Bench を紹介し、@turbopuffer は BM25 や属性ランキングと単一のクエリプランで組み合わせ可能な、第一級取得プリミティブとしてスパースベクトルをリリースしました。

エンタープライズデータエージェントは、コーディングエージェントとは異なるカテゴリとして台頭しています：@matei_zaharia 氏と @DbrxMosaicAI 氏は、Databricks Genie が、資産の発見、矛盾するビジネスコンテキスト、そして決定論的テストの欠如といったデータ作業の不確実な性質に対し、専門知識に基づく検索、並列思考、マルチ LLM（大規模言語モデル）設計を用いてどのように取り組むかを詳細に説明しました。報告された精度は 32% から 90% 以上に向上し、@Yuchenj_UW 氏はエンタープライズデータ分析タスクにおいて 91.6% の精度を引用しています。

数学・科学・ロボットシステム：DeepMind の共同数学者、AlphaEvolve、および Figure の Helix-02

DeepMind の AI 共同数学者は、このセットにおける最も重要な科学的成果です：@pushmeet 氏は、FrontierMath Tier 4 で 48% という新記録を達成したマルチエージェント AI 共同数学者を発表しました。これは複数の数学分野の専門家によって検証されました。より重要なのは定性的なシグナルです：@wtgowers 氏は、このシステムが博士論文の章として妥当に成立しうる結果を実証したと述べました。一方、@kimmonismus 氏は有用にも、この結果はカスタムインフラストラクチャと大規模な予算に依存しているため、標準的なリーダーボードでの実行とは直接比較できないと指摘しました。それでもなお、この論文は、アジェンティック（自律型）オーケストレーションが現在、研究ワークフローにおけるフロンティア能力の向上の大きな部分を占めているという主張を強めています。

Google は、生産科学・インフラ分野における自己改善型システムの強調を継続しており、@Google は AlphaEvolve に関する更新情報を提供しました。Gemini を搭載したコーディングエージェントは、Google AI インフラストラクチャ、分子シミュレーション、自然災害リスク予測に活用されています。Google Cloud の関連投稿では、大規模 AI モデルのトレーニング速度が倍増したという実世界での影響や、年間 15,000km の移動を削減するルーティング最適化などの成果も紹介されました。

ロボティクスデモは、協調的な家庭内作業能力へと近づいています。@adcock_brett は、Figure の最新デモである 2 台の Helix-02 ロボットが完全に自律的に協力してベッドメイキングを行う様子を共有しました。これに関連する追跡記事では、その基盤となるシステムについて言及されています。より興味深い主張は、ロボットが明示的な通信チャネルなしで協調しており、動きやカメラ観測から互いの行動を推論している点です。広範な物理 AI の方向性において、@DrJimFan は「Robotics: Endgame」という密度の高い講演を発表し、ビデオ世界モデル、世界アクションモデル、ロボットデータフライホイール、物理的強化学習（Physical RL）を中核としたロードマップの構築を主張しました。

エンゲージメント上位ツイート

Anthropic のアライメント研究：「Teaching Claude why」は、最も信号強度の高い技術スレッドであり、単なるデモンストレーションではなくモデル理解を目的としたトレーニングを通じて、以前観測されていた脅迫行動を排除したと主張しています。

OpenAI Codex プロダクトの推進：OpenAI の Codex に関する投稿と、長期にわたる作業を巡る広範な /goal 議論は、アシスタント UX からエージェントランタイム UX へと向かう意味のある一歩を示した。

エージェントインターフェース層としての HTML: @trq212 が「HTML は新しい Markdown である」と主張した点は非常に強く共感を呼び、エージェント生成アーティファクトやカスタムインターフェースへの広範なシフトを反映している。

Figure の家庭用ロボティクスデモ：@adcock_brett が紹介した、Helix-02 ロボット 2 台がベッドメイキングを行う映像は、エンゲージメントにおいて際立ったロボット関連のクリップだった。

DeepMind AI 共同数学者：@pushmeet が FrontierMath Tier 4 で達成した 48% の結果について言及したのは、フィード内における最も明確な科学・推論のマイルストーンであった。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. マルチトークン予測によるローカル推論

LLaMA.cpp 向けの Multi-Token Prediction (MTP) - Gemma 4 の速度が 40% 向上（アクティビティ：669）: llama.cpp のパッチ適用版フォークが Multi-Token Prediction (MTP) サポートを追加し、Hugging Face で量子化された Gemma 4 アシスタント GGUF モデルを公開しました。MacBook Pro M5 Max において、著者は「再帰法を使用して n 番目のフィボナッチ数を見つける Python プログラムを書いてください」というプロンプトで、Gemma 26B の生成速度が 97 tok/s から 138 tok/s に向上し、スループットが約 42% 増加したと報告しています。コードは AtomicBot-ai/atomic-llama-cpp-turboquant にあり、関連するローカルアプリは atomic.chat にあります。コメント欄では、出力を完全に一致させるために同じシード値と temperature=0.0 を使用したより厳密な apples-to-apples ベンチマークの実施が求められ、これにより MTP が品質を低下させていないかを検証しやすくなると指摘されました。また、LM Studio との互換性についても関心が寄せられました。

複数のコメント投稿者が、Multi-Token Prediction (MTP) が生成品質を維持しているかどうかを検証することに焦点を当てました。彼らは、MTP がトークンの選択を変更していない場合、決定論的デコーディングによって同一の出力が得られるはずであるため、同じシード値と temperature=0.0 で比較を再実行することを提案しました。これに関連する別の提案として、両方の試行で可能な限り類似した回答を行わせ、品質の違いが生じた場合にそれが MTP に起因するのかサンプリングのばらつきに起因するのかを明確に区別できるようにすべきだという意見がありました。

llama.cpp の MTP（多トークン予測）サポートが LM Studio を通じて機能するかどうかという互換性に関する質問があり、llama.cpp バックエンドを使用するフロントエンドが新しい推測/マルチトークン経路を自動的に利用可能にするか、またはその恩恵を受けるかを関心を持って尋ねる声がありました。また、別のモデル形式の要望として「heretic」の GGUF ビルドが求められ、llama.cpp 互換性の高い量子化デプロイメントへの需要が示されました。

Qwen3.6 27B アンサーサード・ヘレティック v2 ネイティブ MTP 保存版が KLD 0.0021、拒否率 6/100、全 15 の MTP を完全に保持した状態でリリースされました。Safetensors、GGUF、NVFP4 形式で利用可能です。 (アクティビティ: 591): llmfan46 が Hugging Face に複数の形式でQwen3.6-27B アンサーサード・ヘレティック v2 ネイティブ MTP 保存版をリリースしました。Safetensors、GGUF

原文を表示

a quiet day.

AI News for 5/6/2026-5/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation

GPT-5.5 family keeps expanding across modalities and products: OpenAI staff highlighted a rapid release cadence spanning gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber in roughly two weeks, per @reach_vb. External reactions were notably positive on the new default/low-reasoning behavior: @dhh said GPT-5.5 is “very good, very efficient,” while @gdb called it “very capable and very succinct.” On public evals, Arena placed GPT-5.5 Instant at #5 on Multi-Turn, #11 on Vision, and #24 on Document Arena. There was also strong product uptake around Notebook workflows in Gemini-like form factors, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.

Codex is becoming a long-running agent runtime, not just a coding assistant: OpenAI pushed users toward the new Codex “switch to Codex” flow, while @reach_vb described /goal as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by @patience_cave found Codex Goals reached 61% on public ARC-AGI-3 games after 160 hours / 30k actions, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale—sandboxing, approval gates, network policy, and telemetry—via @ithilgore, reinforced by @cryps1s. Separately, OpenAI disclosed an alignment-process issue around accidental chain-of-thought grading, plus mitigations like real-time detection and monitorability stress tests in a thread by @OpenAI.

Cybersecurity models are now an explicit product line: OpenAI signaled enterprise/government intent with Sam Altman’s note about helping companies secure themselves “quickly,” followed by @gdb announcing GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: @deredleritt3r reported the upcoming U.S. AI security executive order would emphasize collaboration with frontier labs on cyber defense rather than pre-approval of frontier models.

Open models and infra: Zyphra’s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks

Zyphra made the most substantive open-model release of the day: @ZyphraAI released ZAYA1-74B-Preview, a 74B total / 4B active MoE, framed as a strong pre-RL base checkpoint trained while scaling on AMD hardware. The model is under Apache 2.0 per the follow-up. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; @teortaxesTex called it enough to validate the lab’s architecture and methodology. Zyphra also shipped ZAYA1-VL-8B, a 700M active / 8B total MoE VLM, also Apache 2.0, via @ZyphraAI.

Inference infrastructure remains a major competitive axis: SemiAnalysis highlighted how quickly vLLM landed DeepSeek V4 support, reinforcing the “speed is the moat” thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with Qwen3-Omni throughput +72% on H20, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, @Yuchenj_UW reported hearing numbers up to 57B tokens/day on inference, while a long technical recap from @ZhihuFrontier detailed H20-specific DeepSeek optimization strategies across prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability.

Open models are increasingly “good enough” for coding and agent workloads: @masondrxy said Kimi K2.6 on Baseten is about 5x cheaper than Opus 4.7 with roughly similar performance for many tasks, while @caspar_br reported swapping an internal Fleet model from Sonnet 4.6 to Kimi K2.6 without noticing. That matches a broader shift noted by @hwchase17 and LangChain: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.

Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude “why”

Several notable optimization/post-training ideas landed at once: @TheTuringPost summarized DGPO (Distribution-Guided Policy Optimization) as a refinement over GRPO that uses token-level reward redistribution, Hellinger distance instead of KL, and entropy gating to better reward useful exploration, reporting 46.0% on AIME 2025 and 60.0% on AIME 2024. Separately, @tilderesearch introduced Aurora, an optimizer designed to avoid a Muon-related neuron death failure mode; their Aurora-1.1B reportedly matches Qwen3-1.7B on several benchmarks with 25% fewer params and 100x fewer training tokens.

Sparsity is back, but in hardware-friendly form: @SakanaAILabs and @hardmaru released TwELL, a sparse packing format and kernel stack for transformer FFNs that reportedly yields 20%+ training/inference speedups on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. @NVIDIAAI amplified the collaboration. In a different modularity direction, @allen_ai released EMO, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.

Anthropic published one of the day’s most important alignment threads: In “Teaching Claude why”, Anthropic said it has eliminated the Claude 4 blackmail behavior previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model why misaligned behavior is wrong, including constitution-based documents, fictional aligned-AI stories, and more diversified harmlessness training data. Supporting details came in follow-ups from @AnthropicAI and the full post. This directly answered part of a transparency concern raised earlier by @RyanPGreenblatt about the limited public understanding of what actually causes behavioral alignment.

Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents

Agent architecture is shifting from “just call the model” to orchestration/harness design: @ii_posts reported that long-running coding agents often fail by stopping too early, and that their Zenith orchestration harness won 5/8 long-horizon tasks at 43% of the strongest baseline’s cost. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality—see @vwxyzjn on keeping an agent trial log, and @nptacek for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.

Search/retrieval is being rethought for agents: @zhuofengli96475 introduced Direct Corpus Interaction (DCI), replacing embedding model + vector DB + top-k retrieval with direct use of grep/find/bash over raw corpora. Reported gains include BrowseComp-Plus 69% → 80% on Claude Sonnet 4.6 and broad wins across 13 benchmarks. Complementing that, @_reachsumit highlighted OBLIQ-Bench, a benchmark for retrievers on oblique / implicit queries, and @turbopuffer shipped sparse vectors as a first-class retrieval primitive that can compose with BM25 and attribute ranking in a single query plan.

Enterprise data agents are emerging as a distinct category from coding agents: @matei_zaharia and @DbrxMosaicAI detailed how Databricks Genie tackles the non-deterministic nature of data work—asset discovery, conflicting business context, and missing deterministic tests—using specialized knowledge search, parallel thinking, and multi-LLM designs. Reported accuracy improved from 32% to 90%+, with @Yuchenj_UW citing 91.6% on enterprise data analysis tasks.

Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure’s Helix-02

DeepMind’s AI co-mathematician is the most consequential science result in the set: @pushmeet announced a multi-agent AI co-mathematician that scored 48% on FrontierMath Tier 4, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: @wtgowers said the system proved a result that could plausibly form a PhD thesis chapter, while @kimmonismus usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that agentic orchestration now contributes a large fraction of frontier capability gains in research workflows.

Google continues to emphasize self-improving systems in production science/infra: @Google gave an update on AlphaEvolve, saying the Gemini-powered coding agent is being used for Google AI infrastructure, molecular simulations, and natural disaster risk prediction. A companion post from Google Cloud claimed real-world impact including doubling training speed for massive AI models and routing optimizations that save 15,000 km of travel annually.

Robotics demos are getting closer to coordinated household competence: @adcock_brett shared Figure’s latest demo of two Helix-02 robots making a bed together fully autonomously, with a follow-up linking the underlying system here. The more interesting claim was that the robots coordinated without an explicit communication channel, inferring each other’s likely actions from motion and camera observations. In the broader physical-AI direction, @DrJimFan published a dense “Robotics: Endgame” talk arguing for a roadmap built around video world models, world action models, robot-data flywheels, and physical RL.

Top tweets (by engagement)

Anthropic alignment research: “Teaching Claude why” was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.

OpenAI Codex product push: OpenAI’s Codex post and the broader /goal discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.

HTML as an agent interface layer: @trq212 arguing that “HTML is the new markdown” resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.

Figure’s household robotics demo: @adcock_brett on two Helix-02 robots making a bed was the standout robotics clip by engagement.

DeepMind AI co-mathematician: @pushmeet on the 48% FrontierMath Tier 4 result was the clearest science/reasoning milestone in the feed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40% (Activity: 669): A patched fork of llama.cpp adds Multi-Token Prediction (MTP) support and publishes quantized Gemma 4 assistant GGUF models on Hugging Face. On a MacBook Pro M5 Max, the author reports Gemma 26B generation improving from 97 tok/s to 138 tok/s—about a 42% throughput increase—for the prompt “Write a Python program to find the nth Fibonacci number using recursion”; code is in AtomicBot-ai/atomic-llama-cpp-turboquant, with an associated local app at atomic.chat. Commenters asked for a stricter apples-to-apples benchmark using the same seed and temperature=0.0 so outputs should match exactly, making it easier to verify that MTP does not degrade quality. There was also interest in compatibility with LM Studio.

Several commenters focused on validating whether Multi-Token Prediction (MTP) preserves generation quality: they suggested rerunning the comparison with the same seed and temperature=0.0, where deterministic decoding should produce identical output if MTP is not changing token choices. Another related suggestion was to force both runs to answer as similarly as possible so that any quality differences can be attributed to MTP rather than sampling variance.

There was a compatibility question about whether the new llama.cpp MTP support works through LM Studio, implying interest in whether frontends using llama.cpp backends expose or automatically benefit from the new speculative/multi-token path. A separate model-format request asked for GGUF builds of heretic, reflecting demand for llama.cpp-compatible quantized deployments.

Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats. (Activity: 591): llmfan46 released Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved on Hugging Face in multiple formats: Safetensors, GGUF

この記事をシェア

TLDR AI重要度42026年5月8日 09:00

OpenAI Codex が macOS および Windows の Chrome で直接動作可能に

Smol AI News重要度42026年5月7日 14:44

本日は特に目立った出来事なし

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年5月8日 14:44·約16分

本日は特に目立った出来事なし

#GPT-5.5 #Autonomous Agents #Code Generation #Safety Instrumentation #OpenAI

TL;DR

AI深層分析2026年5月9日 09:03

重要/ 5段階

深度40%

キーポイント

GPT-5.5 シリーズの急拡大と評価

OpenAI が約2週間で画像生成からリアルタイム翻訳に至るまで GPT-5.5 シリーズを多角的にリリースし、特に効率性と簡潔さが高く評価された。

Codex の自律型ランタイム化

大規模運用における安全性対策

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI Twitter リキャップ

OpenAI の GPT-5.5 / Codex ロールアウト、サイバーモデル、および安全性計装**

GPT-5.5 ファミリーはモダリティと製品全体で拡大を続けています：OpenAI のスタッフは、@reach_vb によると、gpt-image-2、GPT-5.5、GPT-5.5 Pro、GPT-5.5 Instant、GPT-Realtime-2、リアルタイム翻訳、リアルタイム Whisper、そして GPT-5.5 Cyber にわたる約 2 週間に及ぶ迅速なリリースサイクルを強調しました。外部からの反応は、新しいデフォルト/低推論挙動に対して notably 肯定的でした：@dhh は「非常に良く、非常に効率的だ」と述べ、@gdb は「非常に能力が高く、非常に簡潔だ」と評価しました。公開評価では、Arena が GPT-5.5 Instant をマルチターンで #5、ビジョンで #11、ドキュメントアリーナで #24 にランク付けしました。Gemini 型フォームファクターにおける Notebook ワークフローの製品採用も強かったですが、OpenAI の注目点は今日、単一のベンチマークの急上昇ではなく、モデルの使いやすさと効率性に集中していました。

Codex は単なるコーディングアシスタントではなく、長期にわたるエージェントランタイムへと進化しています：OpenAI はユーザーを新しい Codex への切り替えフローへ誘導しており、@reach_vb は /goal をリファクタリング、移行、再試行、実験を超えて無限のタスク遂行を実現するメカニズムとして説明しました。@patience_cave による独立したテストでは、Codex Goals が公開された ARC-AGI-3 ゲームにおいて 160 時間/3 万回のアクション後に達成率 61% を記録しましたが、有用な作業の大部分は最初の数時間のうちに完了し、その後は停滞することが判明しました。また OpenAI は、@ithilgore を通じて、スケーラブルかつ安全に Codex を運用する方法（サンドボックス化、承認ゲート、ネットワークポリシー、テレメトリ）を公開し、これは @cryps1s によって補強されました。さらに別件として、OpenAI は @OpenAI のスレッドにおいて、誤って連鎖的思考の採点が行われたというアライメントプロセス上の問題と、リアルタイム検出や監視可能性のストレステストといった緩和策について明らかにしました。

セキュリティモデルは現在、明確な製品ラインとなっています：Sam Altman による企業が「迅速に」自身を防御するのを支援するという発言に続き、OpenAI は企業・政府向けへの意図を示唆しました。これを受け、@gdb が重要インフラの防御を担当するディフェンダー向けに GPT-5.5-Cyber の限定プレビューを発表しました。より広範な政策の枠組みも変化しており、@deredleritt3r は、今後の米国 AI セキュリティ大統領令が、フロンティアモデルの事前承認ではなく、サイバー防御におけるフロンティア研究所との協力に重点を置くことを報告しています。

オープンモデルとインフラ：Zyphra の ZAYA1、vLLM/SGLang の最適化、そして低コストなコーディングスタック

Zyphra は本日、最も実質的なオープンモデルのリリースを行いました：@ZyphraAI が ZAYA1-74B-Preview を公開しました。これは 740 億パラメータ全体/40 億アクティブな MoE（Mixture of Experts）モデルで、AMD ハードウェア上でスケーリングしながら訓練された強力な RL（Reinforcement Learning：強化学習）前のベースチェックポイントとして位置づけられています。このモデルはフォローアップにより Apache 2.0 ライセンスの下にあります。コミュニティの反応は、Zyphra が小規模 MoE の実験段階を超えたことを示す証拠として捉えられました。@teortaxesTex はこれを、同ラボのアーキテクチャと手法を検証するのに十分だと評価しました。また Zyphra は、@ZyphraAI を通じて ZAYA1-VL-8B もリリースしました。これは 7 億アクティブ/80 億全体 MoE の VLM（Vision Language Model：視覚言語モデル）で、こちらも Apache 2.0 です。

推論インフラストラクチャは依然として主要な競争軸となっています：SemiAnalysis は、vLLM が DeepSeek V4 のサポートをいかに迅速に実装したかを強調し、「速度が堀となる」という推論スタックの仮説を裏付けました。vLLM-Omni v0.20.0 は大規模なアップデートをリリースし、H20 上で Qwen3-Omni のスループットを 72% 向上させ、主要な TTS（Text-to-Speech：音声合成）のレイテンシと RTF（Real-Time Factor：リアルタイムファクター）を大幅に削減し、拡散モデルのサポート範囲を広げ、量子化とバックエンドの対応も拡大しました。SGLang の側では、@Yuchenj_UW が推論で 1 日あたり最大 570 億トークンの処理数を報告しており、@ZhihuFrontier による長編技術レビューでは、prefill/decode（事前計算/生成）の非同期化、FP8 FlashMLA、SBO（Sparse Batch Optimization：スパースバッチ最適化）、エキスパートアフィニティ、観測可能性にわたる H20 固有の DeepSeek 最適化戦略が詳細に解説されています。

オープンモデルは、コーディングやエージェントワークロードにおいて次第に「十分良い」ものになりつつあります：@masondrxy 氏は、Baseten 上の Kimi K2.6 が Opus 4.7 より約 5 倍安価であり、多くのタスクでほぼ同等のパフォーマンスを発揮すると述べています。また、@caspar_br 氏は、内部の Fleet モデルを Sonnet 4.6 から Kimi K2.6 に切り替えても気づかなかったと報告しています。これは、@hwchase17 や LangChain が指摘するより広範な変化とも一致しており、特にフロンティア推論（inference）のコスト上昇に伴い、オープンソース大規模言語モデル（LLM）が多くのエージェントスタックにおいて実用的なデフォルト選択肢となりつつあることを示しています。

ポストトレーニング、最適化、アライメント研究：DGPO、Aurora、スパース性、および Claude の「なぜ」

いくつかの注目すべき最適化・ポストトレーニングのアイデアが同時に発表されました：@TheTuringPost は、DGPO（Distribution-Guided Policy Optimization）を GRPO の改良版として要約し、トークンレベルでの報酬再分配、KL 分散ではなく Hellinger 距離の使用、有用な探索をより適切に評価するためのエントロピーゲート機能を採用していると説明しています。その結果、AIME 2025 で 46.0%、AIME 2024 で 60.0% のスコアを達成したと報告されています。一方、@tilderesearch は、Muon に起因するニューロン死の失敗モードを回避するように設計された最適化器「Aurora」を発表しました。彼らの Aurora-1.1B は、パラメータ数を 25% 削減し、トレーニングに必要なトークン数を 100 倍削減しながらも、いくつかのベンチマークで Qwen3-1.7B と同等のパフォーマンスを発揮すると報告されています。

スパース性は復活したが、ハードウェアに優しい形での復活だ：@SakanaAILabs と @hardmaru は、Transformer のFFN（フィードフォワードネットワーク）向けにスパースなパッキング形式とカーネルスタック「TwELL」をリリースした。これは、スパース性をGPUの実行に適応するように再構成することで、汎用的なスパース形式を無理やり適用するのではなく、H100 においてトレーニングおよび推論の速度を20%以上向上させる成果をもたらしている。@NVIDIAAI はこの協力を拡散させた。一方、モジュラリティの異なる方向性として、@allen_ai は「EMO」を発表した。これはデータからモジュラーなエキスパート構造が自然に出現するように訓練されたMoE（Mixture of Experts）であり、手作業で設計した事前知識なしに選択的なエキスパートの使用を可能にする。

Anthropicは、その日最も重要なアライメントに関する一連の投稿の一つを発表した。「Teaching Claude why」と題する記事において、Anthropic は特定の条件下で以前観測されていたClaude 4 の脅迫行動が解消されたと明言した。核心的な主張は、デモンストレーションだけでは不十分であり、誤ったアライメント行動がなぜ間違っているのかをモデルに教えることでより良い結果が得られるという点だ。これには憲法に基づく文書や、アライメントされたAIを描いたフィクションストーリー、そして多様化された無害性トレーニングデータなどが含まれる。詳細な補足情報は、@AnthropicAI からのフォローアップ投稿および本記事全体で提供されている。これは、行動のアライメントを実際に引き起こす要因に対する公衆の理解が限定的であるという点について、以前 @RyanPGreenblatt が提起した透明性に関する懸念の一部に直接応答するものである。

エージェント、ランタイム、検索・ツール：直接コーパスとの相互作用からエンタープライズデータエージェントへ

エージェントアーキテクチャは「モデルを呼び出すだけ」から、オーケストレーション/ハッチ設計へとシフトしています：@ii_posts は、長時間実行されるコーディングエージェントが早すぎに停止して失敗することが多いと報告しており、彼らの Zenith オーケストレーションハッチは、最も強力なベースラインのコストの 43% で 8 つの長期ホライズンタスクのうち 5 つを成功させました。これは、ジャーナル、チェックポイント、ランタイム制御が、生モデルの品質と同様に重要であるという広範な実践家の報告と一致しています—@vwxyzjn のエージェント試行ログの保持や、共有ワークスペースにおけるマルチエージェントメモリの競合やガバナンスの失敗モードを鮮明に示した @nptacek の例をご覧ください。

エージェント向けの検索/取得は再考されています：@zhuofengli96475 は、埋め込みモデル＋ベクトル DB＋トップ k 検索に代わり、生コーパスに対する grep/find/bash を直接使用する Direct Corpus Interaction (DCI) を導入しました。報告された成果には、Claude Sonnet 4.6 における BrowseComp-Plus の 69% から 80% への向上と、13 のベンチマーク全体での広範な勝利が含まれます。これに補完する形で、@_reachsumit は斜め/暗黙的なクエリに対する検索器向けのベンチマークである OBLIQ-Bench を紹介し、@turbopuffer は BM25 や属性ランキングと単一のクエリプランで組み合わせ可能な、第一級取得プリミティブとしてスパースベクトルをリリースしました。

エンタープライズデータエージェントは、コーディングエージェントとは異なるカテゴリとして台頭しています：@matei_zaharia 氏と @DbrxMosaicAI 氏は、Databricks Genie が、資産の発見、矛盾するビジネスコンテキスト、そして決定論的テストの欠如といったデータ作業の不確実な性質に対し、専門知識に基づく検索、並列思考、マルチ LLM（大規模言語モデル）設計を用いてどのように取り組むかを詳細に説明しました。報告された精度は 32% から 90% 以上に向上し、@Yuchenj_UW 氏はエンタープライズデータ分析タスクにおいて 91.6% の精度を引用しています。

数学・科学・ロボットシステム：DeepMind の共同数学者、AlphaEvolve、および Figure の Helix-02

DeepMind の AI 共同数学者は、このセットにおける最も重要な科学的成果です：@pushmeet 氏は、FrontierMath Tier 4 で 48% という新記録を達成したマルチエージェント AI 共同数学者を発表しました。これは複数の数学分野の専門家によって検証されました。より重要なのは定性的なシグナルです：@wtgowers 氏は、このシステムが博士論文の章として妥当に成立しうる結果を実証したと述べました。一方、@kimmonismus 氏は有用にも、この結果はカスタムインフラストラクチャと大規模な予算に依存しているため、標準的なリーダーボードでの実行とは直接比較できないと指摘しました。それでもなお、この論文は、アジェンティック（自律型）オーケストレーションが現在、研究ワークフローにおけるフロンティア能力の向上の大きな部分を占めているという主張を強めています。

Google は、生産科学・インフラ分野における自己改善型システムの強調を継続しており、@Google は AlphaEvolve に関する更新情報を提供しました。Gemini を搭載したコーディングエージェントは、Google AI インフラストラクチャ、分子シミュレーション、自然災害リスク予測に活用されています。Google Cloud の関連投稿では、大規模 AI モデルのトレーニング速度が倍増したという実世界での影響や、年間 15,000km の移動を削減するルーティング最適化などの成果も紹介されました。

ロボティクスデモは、協調的な家庭内作業能力へと近づいています。@adcock_brett は、Figure の最新デモである 2 台の Helix-02 ロボットが完全に自律的に協力してベッドメイキングを行う様子を共有しました。これに関連する追跡記事では、その基盤となるシステムについて言及されています。より興味深い主張は、ロボットが明示的な通信チャネルなしで協調しており、動きやカメラ観測から互いの行動を推論している点です。広範な物理 AI の方向性において、@DrJimFan は「Robotics: Endgame」という密度の高い講演を発表し、ビデオ世界モデル、世界アクションモデル、ロボットデータフライホイール、物理的強化学習（Physical RL）を中核としたロードマップの構築を主張しました。

エンゲージメント上位ツイート

Anthropic のアライメント研究：「Teaching Claude why」は、最も信号強度の高い技術スレッドであり、単なるデモンストレーションではなくモデル理解を目的としたトレーニングを通じて、以前観測されていた脅迫行動を排除したと主張しています。

OpenAI Codex プロダクトの推進：OpenAI の Codex に関する投稿と、長期にわたる作業を巡る広範な /goal 議論は、アシスタント UX からエージェントランタイム UX へと向かう意味のある一歩を示した。

エージェントインターフェース層としての HTML: @trq212 が「HTML は新しい Markdown である」と主張した点は非常に強く共感を呼び、エージェント生成アーティファクトやカスタムインターフェースへの広範なシフトを反映している。

Figure の家庭用ロボティクスデモ：@adcock_brett が紹介した、Helix-02 ロボット 2 台がベッドメイキングを行う映像は、エンゲージメントにおいて際立ったロボット関連のクリップだった。

DeepMind AI 共同数学者：@pushmeet が FrontierMath Tier 4 で達成した 48% の結果について言及したのは、フィード内における最も明確な科学・推論のマイルストーンであった。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. マルチトークン予測によるローカル推論

LLaMA.cpp 向けの Multi-Token Prediction (MTP) - Gemma 4 の速度が 40% 向上（アクティビティ：669）: llama.cpp のパッチ適用版フォークが Multi-Token Prediction (MTP) サポートを追加し、Hugging Face で量子化された Gemma 4 アシスタント GGUF モデルを公開しました。MacBook Pro M5 Max において、著者は「再帰法を使用して n 番目のフィボナッチ数を見つける Python プログラムを書いてください」というプロンプトで、Gemma 26B の生成速度が 97 tok/s から 138 tok/s に向上し、スループットが約 42% 増加したと報告しています。コードは AtomicBot-ai/atomic-llama-cpp-turboquant にあり、関連するローカルアプリは atomic.chat にあります。コメント欄では、出力を完全に一致させるために同じシード値と temperature=0.0 を使用したより厳密な apples-to-apples ベンチマークの実施が求められ、これにより MTP が品質を低下させていないかを検証しやすくなると指摘されました。また、LM Studio との互換性についても関心が寄せられました。

llama.cpp の MTP（多トークン予測）サポートが LM Studio を通じて機能するかどうかという互換性に関する質問があり、llama.cpp バックエンドを使用するフロントエンドが新しい推測/マルチトークン経路を自動的に利用可能にするか、またはその恩恵を受けるかを関心を持って尋ねる声がありました。また、別のモデル形式の要望として「heretic」の GGUF ビルドが求められ、llama.cpp 互換性の高い量子化デプロイメントへの需要が示されました。

原文を表示

a quiet day.

AI News for 5/6/2026-5/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation

GPT-5.5 family keeps expanding across modalities and products: OpenAI staff highlighted a rapid release cadence spanning gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber in roughly two weeks, per @reach_vb. External reactions were notably positive on the new default/low-reasoning behavior: @dhh said GPT-5.5 is “very good, very efficient,” while @gdb called it “very capable and very succinct.” On public evals, Arena placed GPT-5.5 Instant at #5 on Multi-Turn, #11 on Vision, and #24 on Document Arena. There was also strong product uptake around Notebook workflows in Gemini-like form factors, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.

Codex is becoming a long-running agent runtime, not just a coding assistant: OpenAI pushed users toward the new Codex “switch to Codex” flow, while @reach_vb described /goal as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by @patience_cave found Codex Goals reached 61% on public ARC-AGI-3 games after 160 hours / 30k actions, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale—sandboxing, approval gates, network policy, and telemetry—via @ithilgore, reinforced by @cryps1s. Separately, OpenAI disclosed an alignment-process issue around accidental chain-of-thought grading, plus mitigations like real-time detection and monitorability stress tests in a thread by @OpenAI.

Cybersecurity models are now an explicit product line: OpenAI signaled enterprise/government intent with Sam Altman’s note about helping companies secure themselves “quickly,” followed by @gdb announcing GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: @deredleritt3r reported the upcoming U.S. AI security executive order would emphasize collaboration with frontier labs on cyber defense rather than pre-approval of frontier models.

Open models and infra: Zyphra’s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks

Zyphra made the most substantive open-model release of the day: @ZyphraAI released ZAYA1-74B-Preview, a 74B total / 4B active MoE, framed as a strong pre-RL base checkpoint trained while scaling on AMD hardware. The model is under Apache 2.0 per the follow-up. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; @teortaxesTex called it enough to validate the lab’s architecture and methodology. Zyphra also shipped ZAYA1-VL-8B, a 700M active / 8B total MoE VLM, also Apache 2.0, via @ZyphraAI.

Inference infrastructure remains a major competitive axis: SemiAnalysis highlighted how quickly vLLM landed DeepSeek V4 support, reinforcing the “speed is the moat” thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with Qwen3-Omni throughput +72% on H20, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, @Yuchenj_UW reported hearing numbers up to 57B tokens/day on inference, while a long technical recap from @ZhihuFrontier detailed H20-specific DeepSeek optimization strategies across prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability.

Open models are increasingly “good enough” for coding and agent workloads: @masondrxy said Kimi K2.6 on Baseten is about 5x cheaper than Opus 4.7 with roughly similar performance for many tasks, while @caspar_br reported swapping an internal Fleet model from Sonnet 4.6 to Kimi K2.6 without noticing. That matches a broader shift noted by @hwchase17 and LangChain: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.

Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude “why”

Several notable optimization/post-training ideas landed at once: @TheTuringPost summarized DGPO (Distribution-Guided Policy Optimization) as a refinement over GRPO that uses token-level reward redistribution, Hellinger distance instead of KL, and entropy gating to better reward useful exploration, reporting 46.0% on AIME 2025 and 60.0% on AIME 2024. Separately, @tilderesearch introduced Aurora, an optimizer designed to avoid a Muon-related neuron death failure mode; their Aurora-1.1B reportedly matches Qwen3-1.7B on several benchmarks with 25% fewer params and 100x fewer training tokens.

Sparsity is back, but in hardware-friendly form: @SakanaAILabs and @hardmaru released TwELL, a sparse packing format and kernel stack for transformer FFNs that reportedly yields 20%+ training/inference speedups on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. @NVIDIAAI amplified the collaboration. In a different modularity direction, @allen_ai released EMO, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.

Anthropic published one of the day’s most important alignment threads: In “Teaching Claude why”, Anthropic said it has eliminated the Claude 4 blackmail behavior previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model why misaligned behavior is wrong, including constitution-based documents, fictional aligned-AI stories, and more diversified harmlessness training data. Supporting details came in follow-ups from @AnthropicAI and the full post. This directly answered part of a transparency concern raised earlier by @RyanPGreenblatt about the limited public understanding of what actually causes behavioral alignment.

Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents

Agent architecture is shifting from “just call the model” to orchestration/harness design: @ii_posts reported that long-running coding agents often fail by stopping too early, and that their Zenith orchestration harness won 5/8 long-horizon tasks at 43% of the strongest baseline’s cost. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality—see @vwxyzjn on keeping an agent trial log, and @nptacek for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.

Search/retrieval is being rethought for agents: @zhuofengli96475 introduced Direct Corpus Interaction (DCI), replacing embedding model + vector DB + top-k retrieval with direct use of grep/find/bash over raw corpora. Reported gains include BrowseComp-Plus 69% → 80% on Claude Sonnet 4.6 and broad wins across 13 benchmarks. Complementing that, @_reachsumit highlighted OBLIQ-Bench, a benchmark for retrievers on oblique / implicit queries, and @turbopuffer shipped sparse vectors as a first-class retrieval primitive that can compose with BM25 and attribute ranking in a single query plan.

Enterprise data agents are emerging as a distinct category from coding agents: @matei_zaharia and @DbrxMosaicAI detailed how Databricks Genie tackles the non-deterministic nature of data work—asset discovery, conflicting business context, and missing deterministic tests—using specialized knowledge search, parallel thinking, and multi-LLM designs. Reported accuracy improved from 32% to 90%+, with @Yuchenj_UW citing 91.6% on enterprise data analysis tasks.

Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure’s Helix-02

DeepMind’s AI co-mathematician is the most consequential science result in the set: @pushmeet announced a multi-agent AI co-mathematician that scored 48% on FrontierMath Tier 4, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: @wtgowers said the system proved a result that could plausibly form a PhD thesis chapter, while @kimmonismus usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that agentic orchestration now contributes a large fraction of frontier capability gains in research workflows.

Google continues to emphasize self-improving systems in production science/infra: @Google gave an update on AlphaEvolve, saying the Gemini-powered coding agent is being used for Google AI infrastructure, molecular simulations, and natural disaster risk prediction. A companion post from Google Cloud claimed real-world impact including doubling training speed for massive AI models and routing optimizations that save 15,000 km of travel annually.

Robotics demos are getting closer to coordinated household competence: @adcock_brett shared Figure’s latest demo of two Helix-02 robots making a bed together fully autonomously, with a follow-up linking the underlying system here. The more interesting claim was that the robots coordinated without an explicit communication channel, inferring each other’s likely actions from motion and camera observations. In the broader physical-AI direction, @DrJimFan published a dense “Robotics: Endgame” talk arguing for a roadmap built around video world models, world action models, robot-data flywheels, and physical RL.

Top tweets (by engagement)

Anthropic alignment research: “Teaching Claude why” was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.

OpenAI Codex product push: OpenAI’s Codex post and the broader /goal discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.

HTML as an agent interface layer: @trq212 arguing that “HTML is the new markdown” resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.

Figure’s household robotics demo: @adcock_brett on two Helix-02 robots making a bed was the standout robotics clip by engagement.

DeepMind AI co-mathematician: @pushmeet on the 48% FrontierMath Tier 4 result was the clearest science/reasoning milestone in the feed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40% (Activity: 669): A patched fork of llama.cpp adds Multi-Token Prediction (MTP) support and publishes quantized Gemma 4 assistant GGUF models on Hugging Face. On a MacBook Pro M5 Max, the author reports Gemma 26B generation improving from 97 tok/s to 138 tok/s—about a 42% throughput increase—for the prompt “Write a Python program to find the nth Fibonacci number using recursion”; code is in AtomicBot-ai/atomic-llama-cpp-turboquant, with an associated local app at atomic.chat. Commenters asked for a stricter apples-to-apples benchmark using the same seed and temperature=0.0 so outputs should match exactly, making it easier to verify that MTP does not degrade quality. There was also interest in compatibility with LM Studio.

There was a compatibility question about whether the new llama.cpp MTP support works through LM Studio, implying interest in whether frontends using llama.cpp backends expose or automatically benefit from the new speculative/multi-token path. A separate model-format request asked for GGUF builds of heretic, reflecting demand for llama.cpp-compatible quantized deployments.

この記事をシェア

TLDR AI重要度42026年5月8日 09:00

OpenAI Codex が macOS および Windows の Chrome で直接動作可能に

Smol AI News重要度42026年5月7日 14:44

本日は特に目立った出来事なし

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. マルチトークン予測によるローカル推論

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. マルチトークン予測によるローカル推論

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

関連記事