Smol AI News·2026年4月30日 14:44·約19分

本日は特に目立った出来事なし

#LLM #Cybersecurity #Autonomous Agents #Computer Use #OpenAI

TL;DR

OpenAI の GPT-5.5 がサイバー攻撃シミュレーションで業界最高峰の性能を示し、Codex がコーディングから汎用的なコンピュータ操作へ進化することで、AI エージェントの実用性が劇的に向上した。

AI深層分析2026年5月1日 13:02

重要/ 5段階

深度40%

キーポイント

GPT-5.5 のサイバー防御・攻撃能力の頂点到達

UK AI Security Institute の評価により、GPT-5.5 が複数ステップのサイバー攻撃シミュレーションを完遂し、Claude Mythos Preview と同等かそれ以上の性能（71.4% 対 68.6%）を示した。

Codex の汎用コンピュータ操作への進化

OpenAI は Codex を「誰にでも、どんなコンピュータ作業にも」対応するツールとして刷新し、ドキュメント作成やスライド処理などでの動作速度が 42% 向上した。

セキュリティ機能の強化と製品発表

GPT-5.5 の性能向上に合わせ、ChatGPT にフィッシング耐性のあるサインインや強化された復旧機能を備えた「Advanced Account Security」がリリースされた。

影響分析・編集コメントを表示

影響分析

このニュースは、AI エージェントが単なる情報処理ツールから、実際のサイバー空間やデスクトップ環境で自律的に行動できる「実戦型」へと進化したことを示す転換点です。特に GPT-5.5 のサイバー能力は、セキュリティ業界の防御戦略を根本から見直す必要性を生み出し、Codex の進化はホワイトカラー業務の自動化プロセスに即座の影響を与えるでしょう。

編集コメント

「今日は何も起こらなかった」というタイトルとは裏腹に、AI の実戦能力とセキュリティの両面で業界のパラダイムシフトが起きている重要な日でした。特に GPT-5.5 がサイバー攻撃シミュレーションでトップティア入りした事実は、今後の AI セキュリティ競争を激化させる要因となるでしょう。

静かな一日。

2026年4月29日〜30日のAIニュース。12のサブレッド、544 の Twitter、およびさらに Discord は確認しましたが、それ以上の情報はありませんでした。 AINews のウェブサイトでは過去のすべての号を検索できます。念のためお知らせしますが、AINews は現在 Latent Space のセクションの一部となっています。メールの頻度を選択的に設定（購読または解除）することができます！

AI Twitter リキャップ

OpenAI の GPT-5.5、Codex 拡張、およびサイバー能力評価**

GPT-5.5 は現在、長期にわたるサイバータスクにおいて信頼できるトップティアに位置しています。英国 AI セキュリティ研究所（UK AI Security Institute）は、GPT-5.5 が同機関が実施する多段階のサイバー攻撃シミュレーションをエンドツーエンドで完了した 2 つ目のモデルであると報告しました。また、複数のフォローアップ投稿では、この評価において Claude Mythos Preview とほぼ同等のパフォーマンスを示したことが強調されました。@scaling01 は、GPT-5.5 の平均パス率が 71.4% であるのに対し、Mythos は 68.6% であると引用しました。一方、@cryps1s は、GPT-5.5 が TLO チェーンを 10 回の試行のうち 2 回で解決したのに対し、Mythos は 3 回であったと指摘しました。@polynoamial は、パフォーマンスが推論予算 1 億トークンを超えてもまだ向上しており、明らかな飽和点は見られないことを強調しました。これは、Anthropic が攻撃的なサイバー自動化において独自のリードを持っていたという以前のナラティブを materially（実質的に）変えるものです。OpenAI はまた、このタイミングに合わせて製品側のセキュリティリリースとして ChatGPT の「高度なアカウントセキュリティ」を発表し、フィッシング耐性のあるサインインと強化されたリカバリー機能を追加しました。

Codex はコーディングから一般的なコンピュータ作業へと領域を拡大しています：OpenAI は「すべての人向け、コンピュータで行うあらゆるタスク向け」と明確に位置づけた大規模な Codex アップデートをリリースしました。主な発表では、役割ベースのオンボーディング、アプリとの接続、ドキュメント、スライド、表計算、リサーチ、計画にまたがるワークフローが強調されました。@ajambrosino はこのアップデートを「動的なタスク固有の UI」、「コンピュータ/ブラウザの使用速度が 20% 向上」、「スライドや表の処理能力の強化」、「ぎこちない引き継ぎの減少」と要約しました。一方、@AriX はアップデート後に Computer Use（コンピュータ使用）の処理速度が 42% 向上したと指摘しています。Sam Altman も「今日 Codex に大規模アップグレード！コーディング以外のコンピュータ作業にも試してください」と投稿し、その発表を後押ししました。より広い文脈として、OpenAI は単なるモデル能力ではなく、「Computer Use（コンピュータ使用）エージェント」の UX を製品化しているという傾向が見られます。

ベンチマークにおける差分は限定的でしたが、経済的な意味合いは大きかったです：Artificial Analysis によると、GPT-5.5 Pro は CritPt において GPT-5.4 Pro よりわずかに新しい SOTA（State of the Art）を達成しましたが、興味深い点はスコアそのものではなく、この最先端科学評価において約 60% のコストとトークン使用量を削減しながら向上を実現したことです。これは、GPT-5.5 ファミリーが劇的な知能の飛躍というよりは、高価値なワークフローにおける信頼性の強化と効率性の向上に重点を置いているという広範な議論と一致しています。

オープンウェイトモデルの動向：Qwen3.6、Tencent Hy3-preview、Grok 4.3、および Ling 2.6 1T

Qwen3.6 27B は、本日発表されたオープンウェイトモデルの中で最も重要なリリースのようです。Artificial Analysis により、Qwen3.6 27B は 150B パラメータ未満の領域で新たなオープンウェイトリーダーにランクされ、インテリジェンス指数スコアは 46 を記録しました。これは Gemma 4 31B や以前の Qwen バリアントを上回る結果です。主な特徴は以下の通りです：Apache 2.0 ライセンス、262K のコンテキスト長、ネイティブのマルチモーダル入力対応、そして単一の H100 GPU に収まるほど軽量な BF16（半精度浮動小数点）重みです。 companion モデルである 35B A3B MoE はスコア 43 を記録し、約 3B のアクティブパラメータを持つオープンモデルの中で最強の位置を占めています。ただし、出力トークンあたりの推論コストが高いというトレードオフがあります。Artificial Analysis の試算によると、Qwen3.6 27B は評価スイート全体で約 144M の出力トークンを消費しており、Gemma 4 31B を同環境で実行する際の費用の約 21 倍に相当します。それでも、サイズあたりの能力という観点では、これは注目すべき進歩であると言えます。

Tencent の Hy3-preview は競争力がありますが、クラスをリードするレベルではありません。Artificial Analysis は Hy3-preview を、総パラメータ数 295B、アクティブパラメータ数 21B の MoE（Mixture of Experts）アーキテクチャとし、コンテキスト長は 256K、コミュニティライセンスでは商用利用が制限されていると説明しています。Artificial Analysis のインテリジェンス指数でのスコアは 42 で、直近のオープンな競合モデルである Qwen3.6 27B、DeepSeek V4 Flash、GLM-5.1 に後れをとっています。最も興味深い明るい点は CritPt（科学推論評価）で、そこでは GLM-5.1 と同率の 4.6% を記録しており、全体の位置づけと比較して平均以上の科学的推論能力を示唆しています。

xAI の Grok 4.3 はエージェントベンチマークにおいて劇的に改善し、かつコストも低下しました。Artificial Analysis による測定では、Grok 4.3 のインテリジェンス指数は 53 で、Grok 4.20 v2 より 4 ポイント上昇しており、GDPval-AA では 1500 Elo と大幅な飛躍を遂げました。また、AA は前バージョンと比較して入力価格が約 40%、出力価格が約 60% 低下したと報告しています。リリース版は GDPval-AA において GPT-5.5 にまだ大きく遅れをとっていますが、これは単なるマイナーな改訂ではなく、システム全体およびポストトレーニングにおける本格的な改善であるように見えます。

Ant Group の Ling 2.6 1T は最先端性能よりもコスト効率性を重視したモデルです。Artificial Analysis は、Ling 2.6 1T をパラメータ数 1T の推論非対応モデルとして位置づけ、スコアは 34 と評価しています。GPQA や HLE の数値も妥当であり、ベンチマーク実行コストが約 95 ドルと非常に低いことが特徴です。ただし信頼性には注意が必要です。AA は AA-Omniscience におけるハルシネーション（幻覚）発生率が 92% に達すると報告しています。

DeepSeek のマルチモーダル/ビジョン研究、GUI エージェント、およびトレーニング規模に関する推測

DeepSeek のマルチモーダル方向性は、コンピューター使用エージェントと密接に連動しているように見えます：@nrehiew_ は、DeepSeek が推論中にモデルが直接バウンディングボックスやポイント座標を出力することで V4-Flash にビジョン能力を訓練していると指摘し、これは汎用的な VLM（Vision Language Model）の取り組みではなく、コンピューター使用指向の設計であると解釈しています。もう一つの投稿では、論文における「ビジュアルプリミティブ」タスクが広範なマルチモーダル理解ではなく、ブラウザやコンピューターの直接使用に直接対応すると主張されています（リンク）。この枠組みは、@teortaxesTex による並行する観察とも一致しており、DeepSeek が別個の「V4-Flash-Vision」をリリースするのではなく、ビジョン重みをメインの V4 ラインに統合している可能性があります。

リポジトリの消失自体が一つの物語となりました：リリース後、@teortaxesTex や @arjunkocher を含む複数の観察者が、DeepSeek の「Thinking with Visual Primitives」リポジトリが消去されたことに気づきました。これらのツイートで明確な説明は出てきませんでしたが、この作業が視覚的推論や GUI グラウンディングのための具体的なレシピを示唆していたため、削除によりさらに注目を集めました。

スケーリングに関する議論は、先端的な事前トレーニングにおける非常に大きなトークン数を示唆しています：@teortaxesTex は、100T トークンを超える数が先端モデルではもはや珍しくないと主張し、仮想的な 100T トークンの DeepSeek V4 を「V4 にさらに 2 エポック分追加したようなもの」と見積もりました。一方、@nrehiew_ は、約 100B のアクティブパラメータを持つモデルに対して、約 150T トークンと約 9e25 の事前トレーニング FLOPs（浮動小数点演算）を概算し、保守的な MFU（モデルフロップス利用率）で OpenAI スケールの 100K GB200 クラスター上で約 14 日程度の実行が可能であることを示唆しました。これらは推測に基づく見解ですが、「先端規模」が実務において何を意味するのかを調整するための指標として有用です。

エージェントインフラ、ハーンエンジニアリング、および協調型エージェントシステム

モデル中心の自慢から、ハーン中心のエンジニアリングへの明確な転換が見られます：Cursor は、ランタイム、評価（evals）、劣化修復、モデル固有のカスタマイズに焦点を当てた汎用的なベンチマーク主張ではなく、どのようにエージェントハーンをテスト・調整しているかについて強力なノートを発表しました。@Vtrivedy10 は、Cursor の記述がエージェント構築者間で収束する設計パターンと明確に結びついていることを指摘しました：モデルごとの専用プロンプト/ツール、オフラインとオンラインを組み合わせた評価（evals）、社内での実利用（dogfooding）、そしてコンテキストウィンドウを主要な計算リソースの境界として扱うという点です。

LangChain は引き続き、デプロイとマルチテナント型エージェントインフラのパッケージ化を進めています：@hwchase17 が DeepAgents Deploy を紹介しました。これは deepagents.toml による設定駆動型のクラウドデプロイフローで、エージェント、サンドボックス、認証、フロントエンドの各セクションをカバーします。LangChain スタッフからの関連投稿では、データ分離、委任された資格情報、マルチユーザー環境における RBAC（ロールベースアクセス制御）を実現するエージェントサーバーパターンが詳しく解説されました（例）。これは、デモを企業向けソフトウェアへと転換させる、地味だが重要なレイヤーとしてますます重要視されています。

協調型マルチエージェントワークスペースはより具体化されつつあります：@cmpatino_ が Agent Collabs を紹介しました。これは Hugging Face のバケットと Spaces を組み合わせた共有バックエンドを用いて、多様なエージェントの群れがメッセージ、成果物、進捗状況を交換できるようにするものです。注目すべき点は単に「エージェント同士の協力」にあるのではなく、リソース不足のエージェントが有用な検証作業に参加でき、リソース豊富なエージェントが高コストな実験を処理できるような、軽量な調整プリミティブ（協調の基礎要素）を実現している点にあります。

セキュリティ、サプライチェーン、アカウント強化

オープンソースパッケージの乗っ取りは依然として切実な運用リスクです：Socket によると、人気の高い PyPI パッケージ「lightning」のバージョン 2.6.2 と 2.6.3 で乗っ取りが発生し、インポート時に悪意のあるコードが実行され、Bun をダウンロードして、認証情報の窃取を目的とした 11 MB の難読化された JavaScript ペイロードを実行しました。@theo はこの事案と追加のパッケージ乗っ取り（npm 上の intercom-client）および Linux のゼロデイ脆弱性を関連付け、ソフトウェアサプライチェーン攻撃のテンポが加速しているとの見解を示しています。

セキュリティスキャナが第一級の AI プロダクトへと進化しています：Anthropic は Claude Security をリリースしました。@kimmonismus 氏や後に @_catwu 氏が指摘したように、これは Opus 4.7 を基盤としたリポジトリ脆弱性スキャナで、発見された問題を検証し修正を提案する機能を持っています。Cursor も並行して Cursor Security Review を提供しており、常時オン状態の PR（Pull Request）レビューやスケジュールされたコードベーススキャンが含まれています。これはモデルベンダーが確立された DevSecOps カテゴリに直接参入した最も明確な事例の一つです。

エンゲージメント上位ツイート

OpenAI Codex が一般知識作業へと領域を拡大：OpenAI の Codex 発表と Sam Altman 氏の続報が、当日最大の製品関連投稿となり、「コーディングエージェント」から「コンピューター使用エージェント」への戦略的転換を示唆しました。

GPT-5.5 のサイバー評価結果の重要性：UK AISI（英国 AI セキュリティ研究所）のスレッドは、最もエンゲージメントの高い技術系投稿の一つであり、Anthropic の Mythos との比較を再構築するものとなりました。

Qwen はモデルだけでなく解釈性ツールも提供：Qwen モデル向けのスパースオートエンコーダからなるオープンなスイート「Qwen-Scope」は、生モデル重みではなく、機能制御（feature steering）、デバッグ、データ合成、評価に焦点を当てた稀なリリースとして際立っていました。

Anthropic が大規模なガイダンス/迎合性に関する研究を発表：100 万件の Claude 会話分析を通じて、行動研究が Opus 4.7 や Mythos Preview のトレーニング変更と直接結びつけられたことは、ポストトレーニングループがよりプロダクト化され、データ駆動型になりつつある重要な兆候です。

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. AMD Ryzen 395 Box および Halo Box の発売

AMD 社内製 Ryzen 395 Box が 6 月に登場（活動：1061）: AMD AI Dev Day プレゼンテーションから公開された画像には、2024 年 6 月の発売が予定されている次期「AMD Ryzen 395 Box」が映し出されています。このデバイスは 128GB のユニファイドメモリ（統一メモリ）を搭載しており、「Ryzen AI Max」と呼ばれる技術を活用して、ネイティブで 2000 億パラメータモデルの動作をサポートすると主張しています。プレゼンテーション内の言及から、本製品はレノボ社によって製造されている可能性が示唆されています。しかし、あるエンジニアは確認したところ、このユニットは実質的に 128GB のメモリを搭載し、それ以外の追加変更はない Ryzen 395 そのものであると述べています。コメント欄では、128GB のユニファイド RAM（ランダムアクセスメモリ）上で 2000 億パラメータモデルを実行することの実用性に対して懐疑的な意見が多く見られ、オペレーティングシステムのオーバーヘッドを考慮してもメモリ制約が厳しいことから、その実現可能性に疑問を呈する声が上がっています。

obiwanfatnobi は、「128GB のユニファイド RAM」を搭載したシステム上で「200B モデル（2000 億パラメータモデル）」を実行することの実現性について技術的な指摘を行っています。彼らは、Linux を使用した場合でも利用可能な VRAM（ビデオメモリ）は約「116GB」に過ぎないと強調し、そのような大規模なモデルには不十分である可能性を示唆しています。これは、現在のハードウェア構成が AI ワークロードに対して潜在的な制限を抱えていることを示唆するものです。

promethe42 は、新型の AMD Ryzen 395 Box を「Framework Desktop」と比較し、発売時期が約「12 ヶ月遅れ」であるように見えると指摘しています。彼らは、AMD が新たなハードウェアをリリースする前に、「ドライバー/ROCm（Radeon Open Compute Platform）」の改善に優先順位をつけるべきだと提案しており、ソフトウェアサポートがハードウェアの進歩に追いついていない可能性を示唆しています。

DaniyarQQQ は「統一メモリ 512GB」の必要性についてコメントしており、現在のメモリ容量は現代のコンピューティング要件、特に高性能や AI アプリケーションにおいて不十分である可能性を示唆しています。これは最先端技術におけるメモリ要件が増加する傾向にあることを示しています。

AMD Halo Box (Ryzen 395 128GB) の写真（アクティビティ：467）: Ryzen 395 プロセッサと 128GB の RAM を搭載した AMD Halo Box は、Ubuntu で動作している様子が紹介されました。このユニットにはプログラム可能なライトストリップが含まれており、カスタマイズ機能を強化しています。ただし、CD-ROM ドライブを備えておらず、クラスタリング用の高速ポートも欠いているため、特定の高性能コンピューティングシナリオでの利用が制限される可能性があります。コメント投稿者らは、CD-ROM の欠如とクラスタリング用高速ポートの不在を潜在的な欠点として指摘しており、このデバイスがコンパクトである一方で、これらの省略が特定の技術応用における実用性に影響を与える可能性があることを示しています。

OnkelBB は、AMD Halo Box にクラスタリング用の高速ポートがない点を指摘し、これは複数のノード間でスケーリングする際に高速な相互接続が不可欠な高性能コンピューティング環境での利用を制限する可能性があると述べています。

FoxiPanda は、AMD 製品におけるメモリ帯域幅の増加に対する一般的な要望を強調しており、現在の製品提供がメモリ集約型アプリケーションの要件を満たしていない可能性があることを示唆しています。これは、迅速なデータアクセスと処理を必要とするワークロードにとって重要な要素です。

Stepfunction は、AMD Halo Box が小型フォームファクターのコンピュータであることを指摘しており、これは拡張性や冷却面での潜在的な制約を意味する一方、スペース効率と携帯性においては利点があると述べています。

2. Qwen モデルの革新と応用

Qwen-Scope: Qwen 3.5 モデル向けの公式スパースオートエンコーダー (SAE)（アクティビティ数：393）: Qwen-Scope は、Qwen 3.5 モデル（2B から 35B の MoE まで）向けに新たにリリースされたスパースオートエンコーダー (Sparse Autoencoders, SAEs) のコレクションです。これは全層にわたる内部特徴をマッピングするように設計されています。このツールはモデルの内部概念辞書として機能し、拒否機能のような特定の特徴を抑制する「外科的除去 (Surgical Abliteration)」や、望ましい概念を活性化させる「特徴制御 (Feature Steering)」、トークントリガーによる内部方向を特定するための「モデルデバッグ」など、精密な介入を可能にします。リリースは Apache 2.0 ライセンスの下で行われていますが、Qwen チームは安全フィルターを除去するためにこれを使用しないよう注意を促しています。このツールは Space デモで実演されており、詳細は技術論文に記載されています。コメント投稿者たちは、このリリースが密集型 27B モデル向けの最大規模のオープンソース解釈性ツールとなる可能性が高いとその重要性を強調しており、Google のより小規模な GemmaScope バリアントと比較しています。Qwen 3.6 などの将来のモデルバージョンでも同様のツールの登場に期待が高まっています。

NandaVegg は、密な 27B Qwen モデルに対する Sparse Autoencoders (SAEs) のリリースの重要性を強調し、これが利用可能なオープンソース解釈ツールのうち最大規模となる可能性があると指摘しています。これは、9B や 2B といった小規模モデルのみをサポートしていた従来のツールである GemmaScope と対照的であり、モデル解釈能力における大幅な進歩を示唆するものです。

robert896r1 は、Qwen 3.6 向けの同様のツールのリリースを待ち望んでおり、コミュニティが既存のツールを新しいバージョンに適応させる可能性があることを示唆しています。これは一般的な傾向です

原文を表示

a quiet day.

AI News for 4/29/2026-4/30/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5, Codex expansion, and cyber capability evaluations

GPT-5.5 is now credibly in the top tier for long-horizon cyber tasks: the UK AI Security Institute reported that GPT-5.5 became the second model to complete one of its multi-step cyber-attack simulations end-to-end, and multiple follow-on posts highlighted rough parity with Claude Mythos Preview on this eval: @scaling01 cited 71.4% average pass rate for GPT-5.5 vs 68.6% for Mythos, while @cryps1s noted GPT-5.5 solved the TLO chain in 2/10 attempts vs Mythos’ 3/10. @polynoamial emphasized that performance was still improving past 100M tokens of inference budget, suggesting no obvious saturation yet. This materially changes the earlier narrative that Anthropic had a unique lead in offensive cyber automation. OpenAI also paired this moment with a product-side security release: Advanced Account Security for ChatGPT, adding phishing-resistant sign-in and hardened recovery.

Codex is moving beyond coding into general computer work: OpenAI shipped a substantial Codex update framed explicitly as “for everyone, for any task done with a computer,” with the main announcement highlighting role-based onboarding, app connections, and workflows spanning docs, slides, spreadsheets, research, and planning. @ajambrosino summarized the update as dynamic task-specific UI, 20% faster computer/browser use, better slide/sheet handling, and less clunky handoffs, while @AriX called out that Computer Use runs 42% faster after the update. Sam Altman amplified the launch with “big upgrade for codex today! try it for non-coding computer work.” The broader pattern: OpenAI is productizing “computer-use agent” UX, not just model capability.

Benchmark deltas were incremental but economically meaningful: Artificial Analysis reported GPT-5.5 Pro as a slight new SOTA on CritPt over GPT-5.4 Pro, but the interesting point was not raw score—it achieved the bump with ~60% lower cost and token use on that frontier-science eval. That lines up with broader chatter that the GPT-5.5 family is less about a dramatic intelligence discontinuity than about stronger reliability and better efficiency in high-value workflows.

Open-weight model movement: Qwen3.6, Tencent Hy3-preview, Grok 4.3, and Ling 2.6 1T

Qwen3.6 27B looks like the most important open-weight release of the day: Artificial Analysis ranked Qwen3.6 27B as the new open-weights leader under 150B parameters with an Intelligence Index score of 46, ahead of Gemma 4 31B and prior Qwen variants. Key details: Apache 2.0, 262K context, native multimodal input, and BF16 weights small enough to fit on a single H100. The companion 35B A3B MoE scored 43, making it the strongest open model around 3B active parameters. The tradeoff is expensive inference-by-output-token: AA estimates Qwen3.6 27B used ~144M output tokens on the suite and is roughly 21× the cost of Gemma 4 31B to run there. Still, on capability-per-size it appears to be a notable step.

Tencent’s Hy3-preview is competitive but not class-leading: Artificial Analysis described Hy3-preview as a 295B total / 21B active MoE with 256K context and a restricted-commercial-use community license. It scored 42 on AA’s Intelligence Index, trailing recent open peers like Qwen3.6 27B, DeepSeek V4 Flash, and GLM-5.1. The most interesting bright spot was CritPt, where it matched GLM-5.1 at 4.6%, suggesting better-than-average scientific reasoning relative to its overall position.

xAI’s Grok 4.3 improved sharply on agentic benchmarks while getting cheaper: Artificial Analysis measured Grok 4.3 at 53 on the Intelligence Index, up four points from Grok 4.20 v2, with a major jump on GDPval-AA to 1500 Elo. AA also reported approximately 40% lower input price and 60% lower output price than the prior version. The release still trails GPT-5.5 on GDPval-AA by a wide margin, but it looks like a real systems-and-post-training improvement rather than a minor rev.

Ant Group’s Ling 2.6 1T targets cost-efficiency rather than frontier status: Artificial Analysis positioned Ling 2.6 1T as a 1T-parameter non-reasoning model scoring 34, with decent GPQA/HLE numbers and notably low benchmark-run cost at roughly $95. The caveat is reliability: AA reported a 92% hallucination rate on AA-Omniscience.

DeepSeek multimodal/vision work, GUI agents, and training scale speculation

DeepSeek’s multimodal direction appears tightly coupled to computer-use agents: @nrehiew_ highlighted that DeepSeek trains vision into V4-Flash by having the model directly output bounding boxes and point coordinates during reasoning, interpreting this as a computer-use-oriented design rather than generic VLM work. A second post argues the paper’s “visual primitives” tasks map directly to browser/computer use rather than broad multimodal understanding (link). That framing matches parallel observations from @teortaxesTex that DeepSeek may be integrating vision weights back into the main V4 line rather than releasing a separate “V4-Flash-Vision”.

The repo disappearance became a story of its own: after release, several observers noted that DeepSeek’s “Thinking with Visual Primitives” repo vanished, including @teortaxesTex and @arjunkocher. No clear explanation emerged in these tweets, but the deletion drew more attention because the work suggested a concrete recipe for visual reasoning and GUI grounding.

Scaling chatter points to very large token counts for frontier pretraining: @teortaxesTex argued that >100T tokens is no longer unusual for frontier models and estimated a hypothetical 100T-token DeepSeek V4 as “V4 + 2 more epochs,” while @nrehiew_ back-of-the-enveloped ~150T tokens and ~9e25 pretraining FLOPs for a ~100B active model, suggesting a run feasible in roughly 14 days on an OpenAI-scale 100K GB200 cluster at conservative MFU. These are speculative takes, but useful as calibration for what “frontier-scale” now means in practice.

Agent infrastructure, harness engineering, and collaborative agent systems

There is a clear shift from model-centric bragging to harness-centric engineering: Cursor published a strong note on how it tests and tunes its agent harness, focusing on runtime, evals, degradation repair, and model-specific customization rather than generic benchmark claims. @Vtrivedy10 explicitly connected Cursor’s writeup to design patterns converging across agent builders: bespoke prompts/tools per model, mixed offline+online evals, dogfooding, and treating the context window as the primary compute boundary.

LangChain continues to package deployment and multi-tenant agent infra: @hwchase17 introduced DeepAgents deploy, a config-driven cloud deployment flow via deepagents.toml, covering agent, sandbox, auth, and frontend sections. Related posts from LangChain staff detailed agent-server patterns for data isolation, delegated credentials, and RBAC in multi-user deployments (example). This is increasingly the boring-but-important layer turning demos into enterprise software.

Collaborative multi-agent workspaces are getting more concrete: @cmpatino_ introduced Agent Collabs, using Hugging Face buckets plus Spaces as a shared backend for swarms of heterogeneous agents to exchange messages, artifacts, and progress. The noteworthy idea is not just “agents collaborating,” but lightweight coordination primitives that let weaker agents contribute useful validation work while better-resourced agents handle expensive experiments.

Security, supply chain, and account hardening

Open-source package compromise remains an acute operational risk: Socket reported that the popular PyPI package lightning was compromised in versions 2.6.2 and 2.6.3, with malicious code executing on import, downloading Bun, and running an 11 MB obfuscated JavaScript payload aimed at credential theft. @theo connected that incident with additional package compromises (intercom-client on npm) and a Linux zero day, arguing the tempo of software supply-chain attacks is increasing.

Security scanners are becoming first-class AI products: Anthropic rolled out Claude Security, described by @kimmonismus and later @_catwu as a repo vulnerability scanner that validates findings and suggests fixes, powered by Opus 4.7. Cursor shipped a parallel offering with Cursor Security Review, including always-on PR review and scheduled codebase scans. This is one of the clearest examples of model vendors moving directly into established devsecops categories.

Top tweets (by engagement)

OpenAI Codex broadens into general knowledge work: OpenAI’s Codex announcement and Sam Altman’s follow-up were the day’s biggest product posts, signaling a strategic push from “coding agent” to “computer-use agent”.

GPT-5.5’s cyber eval result mattered: UK AISI’s thread was one of the highest-engagement technical posts and reshaped comparisons with Anthropic’s Mythos.

Qwen shipped interpretability tooling, not just models: Qwen-Scope, an open suite of sparse autoencoders for Qwen models, stood out as a rare release focused on feature steering, debugging, data synthesis, and evaluation rather than raw model weights.

Anthropic published a large-scale guidance/sycophancy study: their analysis of 1M Claude conversations tied behavioral research directly to training changes for Opus 4.7 and Mythos Preview, an important sign that post-training loops are becoming more productized and data-informed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. AMD Ryzen 395 Box and Halo Box Launch

AMD in-house ryzen 395 box coming in June (Activity: 1061): The image from the AMD AI Dev Day presentation showcases the upcoming AMD Ryzen 395 box, which is expected to be released in June. The device features 128GB of unified memory and claims to support 200 billion models natively, leveraging what is referred to as "Ryzen AI Max." The product appears to be manufactured by Lenovo, as suggested by a mention in the presentation. However, an engineer confirmed that the unit is essentially a Ryzen 395 with 128GB and no additional changes. Commenters are skeptical about the practicality of running a 200 billion model on 128GB of unified RAM, questioning the feasibility given the memory constraints even when accounting for operating system overhead.

obiwanfatnobi raises a technical point about the feasibility of running a '200B model' on a system with '128GB unified RAM'. They highlight that even with Linux, the usable VRAM would be around '116GB', which may not be sufficient for such large models, suggesting potential limitations in current hardware configurations for AI workloads.

promethe42 compares the new AMD Ryzen 395 box to a 'Framework Desktop', noting that it seems to be released '12 months later'. They suggest that AMD should prioritize improving their 'drivers/ROCm' before releasing new hardware, indicating that software support might be lagging behind hardware advancements.

DaniyarQQQ comments on the need for '512GB of unified memory', implying that current memory capacities may be insufficient for modern computing demands, particularly in high-performance or AI applications. This suggests a trend towards increasing memory requirements in cutting-edge technology.

AMD Halo Box (Ryzen 395 128GB) photos (Activity: 467): The AMD Halo Box, featuring a Ryzen 395 processor and 128GB of RAM, was showcased running Ubuntu. The unit includes a programmable light strip, enhancing its customization capabilities. However, it lacks a CD-ROM drive and does not feature a fast port for clustering, which may limit its use in certain high-performance computing scenarios. Commenters noted the absence of a CD-ROM and a fast port for clustering as potential drawbacks, indicating that while the device is compact, these omissions could affect its utility in specific technical applications.

OnkelBB points out the lack of a fast port for clustering in the AMD Halo Box, which could limit its use in high-performance computing environments where fast interconnects are crucial for scaling across multiple nodes.

FoxiPanda highlights a common request for increased memory bandwidth in AMD products, suggesting that current offerings may not meet the demands of memory-intensive applications. This is a critical factor for workloads that require rapid data access and processing.

Stepfunction notes that the AMD Halo Box is a small form factor computer, which implies potential constraints on expandability and cooling, but also benefits in terms of space efficiency and portability.

2. Qwen Model Innovations and Applications

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 393): Qwen-Scope is a newly released collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE, designed to map internal features across all layers. This tool acts as a dictionary of the model's internal concepts, allowing for precise interventions such as Surgical Abliteration to suppress specific features like refusal, Feature Steering to activate desired concepts, and Model Debugging to identify token-triggered internal directions. The release is under the Apache 2.0 license, but the Qwen team advises against using it to remove safety filters. The tool is demonstrated in a Space demo and detailed in a technical paper. Commenters highlight the significance of this release as potentially the largest open-source interpretability tool for a dense 27B model, contrasting it with Google's smaller GemmaScope variants. There is anticipation for similar tools for future model iterations like Qwen 3.6.

NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool available. This contrasts with previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.

robert896r1 expresses anticipation for the release of similar tools for Qwen 3.6, suggesting that the community might adapt existing tools for newer iterations. This reflects a common t

この記事をシェア

Smol AI News重要度42026年5月1日 14:44

本日は特に目立った出来事なし

Smol AI News重要度42026年4月29日 14:44

本日は特に目立った出来事なし

Simon Willison Blog重要度52026年6月27日 02:10

OpenAI、GPT-5.6 シリーズの限定プレビューを開始

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年4月30日 14:44·約19分

本日は特に目立った出来事なし

#LLM #Cybersecurity #Autonomous Agents #Computer Use #OpenAI

TL;DR

AI深層分析2026年5月1日 13:02

重要/ 5段階

深度40%

キーポイント

GPT-5.5 のサイバー防御・攻撃能力の頂点到達

Codex の汎用コンピュータ操作への進化

セキュリティ機能の強化と製品発表

GPT-5.5 の性能向上に合わせ、ChatGPT にフィッシング耐性のあるサインインや強化された復旧機能を備えた「Advanced Account Security」がリリースされた。

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI Twitter リキャップ

OpenAI の GPT-5.5、Codex 拡張、およびサイバー能力評価**

GPT-5.5 は現在、長期にわたるサイバータスクにおいて信頼できるトップティアに位置しています。英国 AI セキュリティ研究所（UK AI Security Institute）は、GPT-5.5 が同機関が実施する多段階のサイバー攻撃シミュレーションをエンドツーエンドで完了した 2 つ目のモデルであると報告しました。また、複数のフォローアップ投稿では、この評価において Claude Mythos Preview とほぼ同等のパフォーマンスを示したことが強調されました。@scaling01 は、GPT-5.5 の平均パス率が 71.4% であるのに対し、Mythos は 68.6% であると引用しました。一方、@cryps1s は、GPT-5.5 が TLO チェーンを 10 回の試行のうち 2 回で解決したのに対し、Mythos は 3 回であったと指摘しました。@polynoamial は、パフォーマンスが推論予算 1 億トークンを超えてもまだ向上しており、明らかな飽和点は見られないことを強調しました。これは、Anthropic が攻撃的なサイバー自動化において独自のリードを持っていたという以前のナラティブを materially（実質的に）変えるものです。OpenAI はまた、このタイミングに合わせて製品側のセキュリティリリースとして ChatGPT の「高度なアカウントセキュリティ」を発表し、フィッシング耐性のあるサインインと強化されたリカバリー機能を追加しました。

Codex はコーディングから一般的なコンピュータ作業へと領域を拡大しています：OpenAI は「すべての人向け、コンピュータで行うあらゆるタスク向け」と明確に位置づけた大規模な Codex アップデートをリリースしました。主な発表では、役割ベースのオンボーディング、アプリとの接続、ドキュメント、スライド、表計算、リサーチ、計画にまたがるワークフローが強調されました。@ajambrosino はこのアップデートを「動的なタスク固有の UI」、「コンピュータ/ブラウザの使用速度が 20% 向上」、「スライドや表の処理能力の強化」、「ぎこちない引き継ぎの減少」と要約しました。一方、@AriX はアップデート後に Computer Use（コンピュータ使用）の処理速度が 42% 向上したと指摘しています。Sam Altman も「今日 Codex に大規模アップグレード！コーディング以外のコンピュータ作業にも試してください」と投稿し、その発表を後押ししました。より広い文脈として、OpenAI は単なるモデル能力ではなく、「Computer Use（コンピュータ使用）エージェント」の UX を製品化しているという傾向が見られます。

ベンチマークにおける差分は限定的でしたが、経済的な意味合いは大きかったです：Artificial Analysis によると、GPT-5.5 Pro は CritPt において GPT-5.4 Pro よりわずかに新しい SOTA（State of the Art）を達成しましたが、興味深い点はスコアそのものではなく、この最先端科学評価において約 60% のコストとトークン使用量を削減しながら向上を実現したことです。これは、GPT-5.5 ファミリーが劇的な知能の飛躍というよりは、高価値なワークフローにおける信頼性の強化と効率性の向上に重点を置いているという広範な議論と一致しています。

オープンウェイトモデルの動向：Qwen3.6、Tencent Hy3-preview、Grok 4.3、および Ling 2.6 1T

Qwen3.6 27B は、本日発表されたオープンウェイトモデルの中で最も重要なリリースのようです。Artificial Analysis により、Qwen3.6 27B は 150B パラメータ未満の領域で新たなオープンウェイトリーダーにランクされ、インテリジェンス指数スコアは 46 を記録しました。これは Gemma 4 31B や以前の Qwen バリアントを上回る結果です。主な特徴は以下の通りです：Apache 2.0 ライセンス、262K のコンテキスト長、ネイティブのマルチモーダル入力対応、そして単一の H100 GPU に収まるほど軽量な BF16（半精度浮動小数点）重みです。 companion モデルである 35B A3B MoE はスコア 43 を記録し、約 3B のアクティブパラメータを持つオープンモデルの中で最強の位置を占めています。ただし、出力トークンあたりの推論コストが高いというトレードオフがあります。Artificial Analysis の試算によると、Qwen3.6 27B は評価スイート全体で約 144M の出力トークンを消費しており、Gemma 4 31B を同環境で実行する際の費用の約 21 倍に相当します。それでも、サイズあたりの能力という観点では、これは注目すべき進歩であると言えます。

Tencent の Hy3-preview は競争力がありますが、クラスをリードするレベルではありません。Artificial Analysis は Hy3-preview を、総パラメータ数 295B、アクティブパラメータ数 21B の MoE（Mixture of Experts）アーキテクチャとし、コンテキスト長は 256K、コミュニティライセンスでは商用利用が制限されていると説明しています。Artificial Analysis のインテリジェンス指数でのスコアは 42 で、直近のオープンな競合モデルである Qwen3.6 27B、DeepSeek V4 Flash、GLM-5.1 に後れをとっています。最も興味深い明るい点は CritPt（科学推論評価）で、そこでは GLM-5.1 と同率の 4.6% を記録しており、全体の位置づけと比較して平均以上の科学的推論能力を示唆しています。

xAI の Grok 4.3 はエージェントベンチマークにおいて劇的に改善し、かつコストも低下しました。Artificial Analysis による測定では、Grok 4.3 のインテリジェンス指数は 53 で、Grok 4.20 v2 より 4 ポイント上昇しており、GDPval-AA では 1500 Elo と大幅な飛躍を遂げました。また、AA は前バージョンと比較して入力価格が約 40%、出力価格が約 60% 低下したと報告しています。リリース版は GDPval-AA において GPT-5.5 にまだ大きく遅れをとっていますが、これは単なるマイナーな改訂ではなく、システム全体およびポストトレーニングにおける本格的な改善であるように見えます。

Ant Group の Ling 2.6 1T は最先端性能よりもコスト効率性を重視したモデルです。Artificial Analysis は、Ling 2.6 1T をパラメータ数 1T の推論非対応モデルとして位置づけ、スコアは 34 と評価しています。GPQA や HLE の数値も妥当であり、ベンチマーク実行コストが約 95 ドルと非常に低いことが特徴です。ただし信頼性には注意が必要です。AA は AA-Omniscience におけるハルシネーション（幻覚）発生率が 92% に達すると報告しています。

DeepSeek のマルチモーダル/ビジョン研究、GUI エージェント、およびトレーニング規模に関する推測

DeepSeek のマルチモーダル方向性は、コンピューター使用エージェントと密接に連動しているように見えます：@nrehiew_ は、DeepSeek が推論中にモデルが直接バウンディングボックスやポイント座標を出力することで V4-Flash にビジョン能力を訓練していると指摘し、これは汎用的な VLM（Vision Language Model）の取り組みではなく、コンピューター使用指向の設計であると解釈しています。もう一つの投稿では、論文における「ビジュアルプリミティブ」タスクが広範なマルチモーダル理解ではなく、ブラウザやコンピューターの直接使用に直接対応すると主張されています（リンク）。この枠組みは、@teortaxesTex による並行する観察とも一致しており、DeepSeek が別個の「V4-Flash-Vision」をリリースするのではなく、ビジョン重みをメインの V4 ラインに統合している可能性があります。

リポジトリの消失自体が一つの物語となりました：リリース後、@teortaxesTex や @arjunkocher を含む複数の観察者が、DeepSeek の「Thinking with Visual Primitives」リポジトリが消去されたことに気づきました。これらのツイートで明確な説明は出てきませんでしたが、この作業が視覚的推論や GUI グラウンディングのための具体的なレシピを示唆していたため、削除によりさらに注目を集めました。

スケーリングに関する議論は、先端的な事前トレーニングにおける非常に大きなトークン数を示唆しています：@teortaxesTex は、100T トークンを超える数が先端モデルではもはや珍しくないと主張し、仮想的な 100T トークンの DeepSeek V4 を「V4 にさらに 2 エポック分追加したようなもの」と見積もりました。一方、@nrehiew_ は、約 100B のアクティブパラメータを持つモデルに対して、約 150T トークンと約 9e25 の事前トレーニング FLOPs（浮動小数点演算）を概算し、保守的な MFU（モデルフロップス利用率）で OpenAI スケールの 100K GB200 クラスター上で約 14 日程度の実行が可能であることを示唆しました。これらは推測に基づく見解ですが、「先端規模」が実務において何を意味するのかを調整するための指標として有用です。

エージェントインフラ、ハーンエンジニアリング、および協調型エージェントシステム

モデル中心の自慢から、ハーン中心のエンジニアリングへの明確な転換が見られます：Cursor は、ランタイム、評価（evals）、劣化修復、モデル固有のカスタマイズに焦点を当てた汎用的なベンチマーク主張ではなく、どのようにエージェントハーンをテスト・調整しているかについて強力なノートを発表しました。@Vtrivedy10 は、Cursor の記述がエージェント構築者間で収束する設計パターンと明確に結びついていることを指摘しました：モデルごとの専用プロンプト/ツール、オフラインとオンラインを組み合わせた評価（evals）、社内での実利用（dogfooding）、そしてコンテキストウィンドウを主要な計算リソースの境界として扱うという点です。

LangChain は引き続き、デプロイとマルチテナント型エージェントインフラのパッケージ化を進めています：@hwchase17 が DeepAgents Deploy を紹介しました。これは deepagents.toml による設定駆動型のクラウドデプロイフローで、エージェント、サンドボックス、認証、フロントエンドの各セクションをカバーします。LangChain スタッフからの関連投稿では、データ分離、委任された資格情報、マルチユーザー環境における RBAC（ロールベースアクセス制御）を実現するエージェントサーバーパターンが詳しく解説されました（例）。これは、デモを企業向けソフトウェアへと転換させる、地味だが重要なレイヤーとしてますます重要視されています。

協調型マルチエージェントワークスペースはより具体化されつつあります：@cmpatino_ が Agent Collabs を紹介しました。これは Hugging Face のバケットと Spaces を組み合わせた共有バックエンドを用いて、多様なエージェントの群れがメッセージ、成果物、進捗状況を交換できるようにするものです。注目すべき点は単に「エージェント同士の協力」にあるのではなく、リソース不足のエージェントが有用な検証作業に参加でき、リソース豊富なエージェントが高コストな実験を処理できるような、軽量な調整プリミティブ（協調の基礎要素）を実現している点にあります。

セキュリティ、サプライチェーン、アカウント強化

オープンソースパッケージの乗っ取りは依然として切実な運用リスクです：Socket によると、人気の高い PyPI パッケージ「lightning」のバージョン 2.6.2 と 2.6.3 で乗っ取りが発生し、インポート時に悪意のあるコードが実行され、Bun をダウンロードして、認証情報の窃取を目的とした 11 MB の難読化された JavaScript ペイロードを実行しました。@theo はこの事案と追加のパッケージ乗っ取り（npm 上の intercom-client）および Linux のゼロデイ脆弱性を関連付け、ソフトウェアサプライチェーン攻撃のテンポが加速しているとの見解を示しています。

セキュリティスキャナが第一級の AI プロダクトへと進化しています：Anthropic は Claude Security をリリースしました。@kimmonismus 氏や後に @_catwu 氏が指摘したように、これは Opus 4.7 を基盤としたリポジトリ脆弱性スキャナで、発見された問題を検証し修正を提案する機能を持っています。Cursor も並行して Cursor Security Review を提供しており、常時オン状態の PR（Pull Request）レビューやスケジュールされたコードベーススキャンが含まれています。これはモデルベンダーが確立された DevSecOps カテゴリに直接参入した最も明確な事例の一つです。

エンゲージメント上位ツイート

OpenAI Codex が一般知識作業へと領域を拡大：OpenAI の Codex 発表と Sam Altman 氏の続報が、当日最大の製品関連投稿となり、「コーディングエージェント」から「コンピューター使用エージェント」への戦略的転換を示唆しました。

GPT-5.5 のサイバー評価結果の重要性：UK AISI（英国 AI セキュリティ研究所）のスレッドは、最もエンゲージメントの高い技術系投稿の一つであり、Anthropic の Mythos との比較を再構築するものとなりました。

Qwen はモデルだけでなく解釈性ツールも提供：Qwen モデル向けのスパースオートエンコーダからなるオープンなスイート「Qwen-Scope」は、生モデル重みではなく、機能制御（feature steering）、デバッグ、データ合成、評価に焦点を当てた稀なリリースとして際立っていました。

Anthropic が大規模なガイダンス/迎合性に関する研究を発表：100 万件の Claude 会話分析を通じて、行動研究が Opus 4.7 や Mythos Preview のトレーニング変更と直接結びつけられたことは、ポストトレーニングループがよりプロダクト化され、データ駆動型になりつつある重要な兆候です。

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. AMD Ryzen 395 Box および Halo Box の発売

AMD 社内製 Ryzen 395 Box が 6 月に登場（活動：1061）: AMD AI Dev Day プレゼンテーションから公開された画像には、2024 年 6 月の発売が予定されている次期「AMD Ryzen 395 Box」が映し出されています。このデバイスは 128GB のユニファイドメモリ（統一メモリ）を搭載しており、「Ryzen AI Max」と呼ばれる技術を活用して、ネイティブで 2000 億パラメータモデルの動作をサポートすると主張しています。プレゼンテーション内の言及から、本製品はレノボ社によって製造されている可能性が示唆されています。しかし、あるエンジニアは確認したところ、このユニットは実質的に 128GB のメモリを搭載し、それ以外の追加変更はない Ryzen 395 そのものであると述べています。コメント欄では、128GB のユニファイド RAM（ランダムアクセスメモリ）上で 2000 億パラメータモデルを実行することの実用性に対して懐疑的な意見が多く見られ、オペレーティングシステムのオーバーヘッドを考慮してもメモリ制約が厳しいことから、その実現可能性に疑問を呈する声が上がっています。

promethe42 は、新型の AMD Ryzen 395 Box を「Framework Desktop」と比較し、発売時期が約「12 ヶ月遅れ」であるように見えると指摘しています。彼らは、AMD が新たなハードウェアをリリースする前に、「ドライバー/ROCm（Radeon Open Compute Platform）」の改善に優先順位をつけるべきだと提案しており、ソフトウェアサポートがハードウェアの進歩に追いついていない可能性を示唆しています。

DaniyarQQQ は「統一メモリ 512GB」の必要性についてコメントしており、現在のメモリ容量は現代のコンピューティング要件、特に高性能や AI アプリケーションにおいて不十分である可能性を示唆しています。これは最先端技術におけるメモリ要件が増加する傾向にあることを示しています。

AMD Halo Box (Ryzen 395 128GB) の写真（アクティビティ：467）: Ryzen 395 プロセッサと 128GB の RAM を搭載した AMD Halo Box は、Ubuntu で動作している様子が紹介されました。このユニットにはプログラム可能なライトストリップが含まれており、カスタマイズ機能を強化しています。ただし、CD-ROM ドライブを備えておらず、クラスタリング用の高速ポートも欠いているため、特定の高性能コンピューティングシナリオでの利用が制限される可能性があります。コメント投稿者らは、CD-ROM の欠如とクラスタリング用高速ポートの不在を潜在的な欠点として指摘しており、このデバイスがコンパクトである一方で、これらの省略が特定の技術応用における実用性に影響を与える可能性があることを示しています。

FoxiPanda は、AMD 製品におけるメモリ帯域幅の増加に対する一般的な要望を強調しており、現在の製品提供がメモリ集約型アプリケーションの要件を満たしていない可能性があることを示唆しています。これは、迅速なデータアクセスと処理を必要とするワークロードにとって重要な要素です。

Stepfunction は、AMD Halo Box が小型フォームファクターのコンピュータであることを指摘しており、これは拡張性や冷却面での潜在的な制約を意味する一方、スペース効率と携帯性においては利点があると述べています。

2. Qwen モデルの革新と応用

Qwen-Scope: Qwen 3.5 モデル向けの公式スパースオートエンコーダー (SAE)（アクティビティ数：393）: Qwen-Scope は、Qwen 3.5 モデル（2B から 35B の MoE まで）向けに新たにリリースされたスパースオートエンコーダー (Sparse Autoencoders, SAEs) のコレクションです。これは全層にわたる内部特徴をマッピングするように設計されています。このツールはモデルの内部概念辞書として機能し、拒否機能のような特定の特徴を抑制する「外科的除去 (Surgical Abliteration)」や、望ましい概念を活性化させる「特徴制御 (Feature Steering)」、トークントリガーによる内部方向を特定するための「モデルデバッグ」など、精密な介入を可能にします。リリースは Apache 2.0 ライセンスの下で行われていますが、Qwen チームは安全フィルターを除去するためにこれを使用しないよう注意を促しています。このツールは Space デモで実演されており、詳細は技術論文に記載されています。コメント投稿者たちは、このリリースが密集型 27B モデル向けの最大規模のオープンソース解釈性ツールとなる可能性が高いとその重要性を強調しており、Google のより小規模な GemmaScope バリアントと比較しています。Qwen 3.6 などの将来のモデルバージョンでも同様のツールの登場に期待が高まっています。

原文を表示

a quiet day.

AI News for 4/29/2026-4/30/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5, Codex expansion, and cyber capability evaluations

GPT-5.5 is now credibly in the top tier for long-horizon cyber tasks: the UK AI Security Institute reported that GPT-5.5 became the second model to complete one of its multi-step cyber-attack simulations end-to-end, and multiple follow-on posts highlighted rough parity with Claude Mythos Preview on this eval: @scaling01 cited 71.4% average pass rate for GPT-5.5 vs 68.6% for Mythos, while @cryps1s noted GPT-5.5 solved the TLO chain in 2/10 attempts vs Mythos’ 3/10. @polynoamial emphasized that performance was still improving past 100M tokens of inference budget, suggesting no obvious saturation yet. This materially changes the earlier narrative that Anthropic had a unique lead in offensive cyber automation. OpenAI also paired this moment with a product-side security release: Advanced Account Security for ChatGPT, adding phishing-resistant sign-in and hardened recovery.

Codex is moving beyond coding into general computer work: OpenAI shipped a substantial Codex update framed explicitly as “for everyone, for any task done with a computer,” with the main announcement highlighting role-based onboarding, app connections, and workflows spanning docs, slides, spreadsheets, research, and planning. @ajambrosino summarized the update as dynamic task-specific UI, 20% faster computer/browser use, better slide/sheet handling, and less clunky handoffs, while @AriX called out that Computer Use runs 42% faster after the update. Sam Altman amplified the launch with “big upgrade for codex today! try it for non-coding computer work.” The broader pattern: OpenAI is productizing “computer-use agent” UX, not just model capability.

Benchmark deltas were incremental but economically meaningful: Artificial Analysis reported GPT-5.5 Pro as a slight new SOTA on CritPt over GPT-5.4 Pro, but the interesting point was not raw score—it achieved the bump with ~60% lower cost and token use on that frontier-science eval. That lines up with broader chatter that the GPT-5.5 family is less about a dramatic intelligence discontinuity than about stronger reliability and better efficiency in high-value workflows.

Open-weight model movement: Qwen3.6, Tencent Hy3-preview, Grok 4.3, and Ling 2.6 1T

Qwen3.6 27B looks like the most important open-weight release of the day: Artificial Analysis ranked Qwen3.6 27B as the new open-weights leader under 150B parameters with an Intelligence Index score of 46, ahead of Gemma 4 31B and prior Qwen variants. Key details: Apache 2.0, 262K context, native multimodal input, and BF16 weights small enough to fit on a single H100. The companion 35B A3B MoE scored 43, making it the strongest open model around 3B active parameters. The tradeoff is expensive inference-by-output-token: AA estimates Qwen3.6 27B used ~144M output tokens on the suite and is roughly 21× the cost of Gemma 4 31B to run there. Still, on capability-per-size it appears to be a notable step.

Tencent’s Hy3-preview is competitive but not class-leading: Artificial Analysis described Hy3-preview as a 295B total / 21B active MoE with 256K context and a restricted-commercial-use community license. It scored 42 on AA’s Intelligence Index, trailing recent open peers like Qwen3.6 27B, DeepSeek V4 Flash, and GLM-5.1. The most interesting bright spot was CritPt, where it matched GLM-5.1 at 4.6%, suggesting better-than-average scientific reasoning relative to its overall position.

xAI’s Grok 4.3 improved sharply on agentic benchmarks while getting cheaper: Artificial Analysis measured Grok 4.3 at 53 on the Intelligence Index, up four points from Grok 4.20 v2, with a major jump on GDPval-AA to 1500 Elo. AA also reported approximately 40% lower input price and 60% lower output price than the prior version. The release still trails GPT-5.5 on GDPval-AA by a wide margin, but it looks like a real systems-and-post-training improvement rather than a minor rev.

Ant Group’s Ling 2.6 1T targets cost-efficiency rather than frontier status: Artificial Analysis positioned Ling 2.6 1T as a 1T-parameter non-reasoning model scoring 34, with decent GPQA/HLE numbers and notably low benchmark-run cost at roughly $95. The caveat is reliability: AA reported a 92% hallucination rate on AA-Omniscience.

DeepSeek multimodal/vision work, GUI agents, and training scale speculation

DeepSeek’s multimodal direction appears tightly coupled to computer-use agents: @nrehiew_ highlighted that DeepSeek trains vision into V4-Flash by having the model directly output bounding boxes and point coordinates during reasoning, interpreting this as a computer-use-oriented design rather than generic VLM work. A second post argues the paper’s “visual primitives” tasks map directly to browser/computer use rather than broad multimodal understanding (link). That framing matches parallel observations from @teortaxesTex that DeepSeek may be integrating vision weights back into the main V4 line rather than releasing a separate “V4-Flash-Vision”.

The repo disappearance became a story of its own: after release, several observers noted that DeepSeek’s “Thinking with Visual Primitives” repo vanished, including @teortaxesTex and @arjunkocher. No clear explanation emerged in these tweets, but the deletion drew more attention because the work suggested a concrete recipe for visual reasoning and GUI grounding.

Scaling chatter points to very large token counts for frontier pretraining: @teortaxesTex argued that >100T tokens is no longer unusual for frontier models and estimated a hypothetical 100T-token DeepSeek V4 as “V4 + 2 more epochs,” while @nrehiew_ back-of-the-enveloped ~150T tokens and ~9e25 pretraining FLOPs for a ~100B active model, suggesting a run feasible in roughly 14 days on an OpenAI-scale 100K GB200 cluster at conservative MFU. These are speculative takes, but useful as calibration for what “frontier-scale” now means in practice.

Agent infrastructure, harness engineering, and collaborative agent systems

There is a clear shift from model-centric bragging to harness-centric engineering: Cursor published a strong note on how it tests and tunes its agent harness, focusing on runtime, evals, degradation repair, and model-specific customization rather than generic benchmark claims. @Vtrivedy10 explicitly connected Cursor’s writeup to design patterns converging across agent builders: bespoke prompts/tools per model, mixed offline+online evals, dogfooding, and treating the context window as the primary compute boundary.

LangChain continues to package deployment and multi-tenant agent infra: @hwchase17 introduced DeepAgents deploy, a config-driven cloud deployment flow via deepagents.toml, covering agent, sandbox, auth, and frontend sections. Related posts from LangChain staff detailed agent-server patterns for data isolation, delegated credentials, and RBAC in multi-user deployments (example). This is increasingly the boring-but-important layer turning demos into enterprise software.

Collaborative multi-agent workspaces are getting more concrete: @cmpatino_ introduced Agent Collabs, using Hugging Face buckets plus Spaces as a shared backend for swarms of heterogeneous agents to exchange messages, artifacts, and progress. The noteworthy idea is not just “agents collaborating,” but lightweight coordination primitives that let weaker agents contribute useful validation work while better-resourced agents handle expensive experiments.

Security, supply chain, and account hardening

Open-source package compromise remains an acute operational risk: Socket reported that the popular PyPI package lightning was compromised in versions 2.6.2 and 2.6.3, with malicious code executing on import, downloading Bun, and running an 11 MB obfuscated JavaScript payload aimed at credential theft. @theo connected that incident with additional package compromises (intercom-client on npm) and a Linux zero day, arguing the tempo of software supply-chain attacks is increasing.

Security scanners are becoming first-class AI products: Anthropic rolled out Claude Security, described by @kimmonismus and later @_catwu as a repo vulnerability scanner that validates findings and suggests fixes, powered by Opus 4.7. Cursor shipped a parallel offering with Cursor Security Review, including always-on PR review and scheduled codebase scans. This is one of the clearest examples of model vendors moving directly into established devsecops categories.

Top tweets (by engagement)

OpenAI Codex broadens into general knowledge work: OpenAI’s Codex announcement and Sam Altman’s follow-up were the day’s biggest product posts, signaling a strategic push from “coding agent” to “computer-use agent”.

GPT-5.5’s cyber eval result mattered: UK AISI’s thread was one of the highest-engagement technical posts and reshaped comparisons with Anthropic’s Mythos.

Qwen shipped interpretability tooling, not just models: Qwen-Scope, an open suite of sparse autoencoders for Qwen models, stood out as a rare release focused on feature steering, debugging, data synthesis, and evaluation rather than raw model weights.

Anthropic published a large-scale guidance/sycophancy study: their analysis of 1M Claude conversations tied behavioral research directly to training changes for Opus 4.7 and Mythos Preview, an important sign that post-training loops are becoming more productized and data-informed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. AMD Ryzen 395 Box and Halo Box Launch

AMD in-house ryzen 395 box coming in June (Activity: 1061): The image from the AMD AI Dev Day presentation showcases the upcoming AMD Ryzen 395 box, which is expected to be released in June. The device features 128GB of unified memory and claims to support 200 billion models natively, leveraging what is referred to as "Ryzen AI Max." The product appears to be manufactured by Lenovo, as suggested by a mention in the presentation. However, an engineer confirmed that the unit is essentially a Ryzen 395 with 128GB and no additional changes. Commenters are skeptical about the practicality of running a 200 billion model on 128GB of unified RAM, questioning the feasibility given the memory constraints even when accounting for operating system overhead.

promethe42 compares the new AMD Ryzen 395 box to a 'Framework Desktop', noting that it seems to be released '12 months later'. They suggest that AMD should prioritize improving their 'drivers/ROCm' before releasing new hardware, indicating that software support might be lagging behind hardware advancements.

DaniyarQQQ comments on the need for '512GB of unified memory', implying that current memory capacities may be insufficient for modern computing demands, particularly in high-performance or AI applications. This suggests a trend towards increasing memory requirements in cutting-edge technology.

AMD Halo Box (Ryzen 395 128GB) photos (Activity: 467): The AMD Halo Box, featuring a Ryzen 395 processor and 128GB of RAM, was showcased running Ubuntu. The unit includes a programmable light strip, enhancing its customization capabilities. However, it lacks a CD-ROM drive and does not feature a fast port for clustering, which may limit its use in certain high-performance computing scenarios. Commenters noted the absence of a CD-ROM and a fast port for clustering as potential drawbacks, indicating that while the device is compact, these omissions could affect its utility in specific technical applications.

FoxiPanda highlights a common request for increased memory bandwidth in AMD products, suggesting that current offerings may not meet the demands of memory-intensive applications. This is a critical factor for workloads that require rapid data access and processing.

Stepfunction notes that the AMD Halo Box is a small form factor computer, which implies potential constraints on expandability and cooling, but also benefits in terms of space efficiency and portability.

2. Qwen Model Innovations and Applications

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 393): Qwen-Scope is a newly released collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE, designed to map internal features across all layers. This tool acts as a dictionary of the model's internal concepts, allowing for precise interventions such as Surgical Abliteration to suppress specific features like refusal, Feature Steering to activate desired concepts, and Model Debugging to identify token-triggered internal directions. The release is under the Apache 2.0 license, but the Qwen team advises against using it to remove safety filters. The tool is demonstrated in a Space demo and detailed in a technical paper. Commenters highlight the significance of this release as potentially the largest open-source interpretability tool for a dense 27B model, contrasting it with Google's smaller GemmaScope variants. There is anticipation for similar tools for future model iterations like Qwen 3.6.

robert896r1 expresses anticipation for the release of similar tools for Qwen 3.6, suggesting that the community might adapt existing tools for newer iterations. This reflects a common t

この記事をシェア

Smol AI News重要度42026年5月1日 14:44

本日は特に目立った出来事なし

Smol AI News重要度42026年4月29日 14:44

本日は特に目立った出来事なし

Simon Willison Blog重要度52026年6月27日 02:10

OpenAI、GPT-5.6 シリーズの限定プレビューを開始

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. AMD Ryzen 395 Box および Halo Box の発売

2. Qwen モデルの革新と応用

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. AMD Ryzen 395 Box and Halo Box Launch

2. Qwen Model Innovations and Applications

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit リキャップ

/r/LocalLlama + /r/localLLM リキャップ

1. AMD Ryzen 395 Box および Halo Box の発売

2. Qwen モデルの革新と応用

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. AMD Ryzen 395 Box and Halo Box Launch

2. Qwen Model Innovations and Applications

関連記事