Smol AI News·2026年4月27日 14:44·約12分

本日は特に目立った出来事なし

#OpenAI #GPT-5.5 #マルチクラウド #Microsoft Azure #AWS Bedrock #ベンチマーク

TL;DR

OpenAIがMicrosoftとの独占提携を緩和し主要モデルを他クラウド（AWS等）へ展開する方針を示した一方、GPT-5.5はベンチマークでOpus 4.7に劣るなど、AI業界の競争環境とモデル性能の現状が示された。

AI深層分析2026年4月28日 10:44

重要/ 5段階

深度40%

キーポイント

OpenAIのクラウド独占解除とパートナーシップ更新

OpenAIはMicrosoftとの提携を更新し、Microsoftを「主要クラウド」としつつも、GPTモデルをAWS BedrockやGoogle TPUなど他プラットフォームでも提供可能にする方針を示した。これによりMicrosoftのライセンスは非独占となり、AGI条項も実質消失したと分析されている。

GPT-5.5の性能評価と競合他社との比較

GPT-5.5はGPT-5.4から大幅に性能向上したが、WeirdMLやLMSYS Arenaなどの評価では、Opus 4.7に劣る結果となった。特にコードや数学分野での順位は上位だが、全体としての優位性は限定的である。

開発者からの実用性に関するフィードバック

GPUカーネル生成などの高度なコーディングタスクにおいて、GPT-5.5は実務者から肯定的なフィードバックを得ている一方で、評価範囲は中高度推論に留まり、極限高難易度（xHigh）の評価はまだ完了していない。

影響分析・編集コメントを表示

影響分析

OpenAIのクラウド独占解除は、企業ユーザーにとって選択肢が広がりコスト競争が激化する可能性がある一方、Microsoftの影響力相対化をもたらす。また、GPT-5.5が競合に追いつかれていない事実は、単なるパラメータ増加ではなく、推論品質やアーキテクチャの革新が競争のカギとなることを示唆している。

編集コメント

OpenAIの戦略転換は業界標準を揺るがすものであり、クラウド事業者間の競争がAIモデル供給においてさらに重要になる。GPT-5.5の性能が競合に追いついていない点は、ユーザーにとってベンチマーク選定や実証の重要性を再認識させる結果となった。

静かな一日。

2026年4月26日〜27日のAIニュース。私たちは12のサブレッド、544件のTwitter、およびDiscordの投稿を確認しました。AINewsのウェブサイトでは、過去のすべての号を検索できます。念のためお知らせしますが、AINewsは現在Latent Spaceの一部となっています。メール配信頻度へのオプトイン・オプトアウトが可能です！

AI Twitter recap

OpenAIのディストリビューションシフト、GPT-5.5ベンチマーク、Codex/Copilotの価格設定シグナル**

OpenAIはAzure独占性を緩和：@sama氏は、OpenAIがMicrosoftとのパートナーシップを更新し、Microsoftが主要なクラウド提供者であり続ける一方で、OpenAIは現在すべてのクラウドに製品を提供できるようになったと述べました。製品/モデルのコミットメントは2032年まで、収益配分は2030年まで延長されます。@scaling01氏と@kimmonismus氏はすぐにこの意味を指摘しました：OpenAIは現在、Google TPU / AWS Trainium / Bedrock経由で配布可能であり、MicrosoftのOpenAI知財へのライセンスは非独占になります。@ajassy氏は、OpenAIモデルが数週間以内にAWS Bedrockに登場すると確認しました。@simw氏は、新しい文言は古いAGI条項が事実上消滅したことを意味すると指摘しました。

GPT-5.5は広範なアップグレードではあるが、一様に優位というわけではない：@htihleによるコミュニティ評価によると、GPT-5.5の「思考なし」モードはWeirdMLで57.4%だったGPT-5.4から67.1%に向上したが、より少ないトークン数で76.4%を記録するOpus 4.7の「思考なし」モードには依然として劣っている。@arenaによるLMSYS Arenaの結果では、GPT-5.5はCode Arenaで9位、Documentで6位、Textで7位、Mathで3位、Searchで2位、Visionで5位、Expert Arenaで5位となった。また、Arenaは現在の評価が中・高度な推論（medium/high reasoning）をカバーしており、xHighはまだ保留中であることを明らかにした（1, 2）。@gdbからの実務家のフィードバックでは、GPUカーネルなどの難易度の高いコーディングタスクに対して肯定的な意見があった一方、@htihleからは「思考なし」モードにおける「圧縮されたCoT（Chain of Thought）の漏洩」や malformed な出力に関する報告もあった。

開発者の経済モデルがより明確になっている：GitHubは6月1日よりCopilotの使用量ベースの課金に移行すると発表した。これは、エージェント型ワークフローがより多くのランタイムを消費するようになる中、重要な転換点である。並行して、@HangsiinはCodexの使用量乗数を文書化した：GPT-5.4 fast = 2倍、GPT-5.5 fast = 2.5倍であり、5.4-miniおよびGPT-5.3-Codexは大幅に安価である。@samaは、Codexが20ドルでも強力な価値提供であると主張した。OpenAIもまた、Issue TrackerからCodexエージェントへ「オープンなイシュー → エージェント → PR（プルリクエスト）→ 人間のレビュー」をつなぐオーケストレーションレイヤーであるSymphonyを、@OpenAIDevsを通じてオープンソース化した。

Xiaomi MiMo-V2.5、Kimi K2.6、そして中国のエージェント指向のオープンウェイト推進

MiMo-V2.5 は本日の主要なオープンソースリリースの一つです：@XiaomiMiMo が MIT ライセンスの下で MiMo-V2.5-Pro と MiMo-V2.5 をオープンソース化し、両方とも 100 万トークンのコンテキスト長に対応しています。Pro モデルは複雑なエージェントやコーディングモデルとして位置づけられ、小規模モデルはネイティブのオムニモーダルエージェントとして提供されています。@eliebakouch によるコミュニティの要約には有用な技術詳細が含まれています：MiMo-V2.5-Pro は総パラメータ数約 1T、アクティブパラメータ数 42B で、FP8 精度で 27T トークンを使用して学習されています。一方、MiMo-V2.5 は総パラメータ数約 310B、アクティブパラメータ数 15B で、48T トークンを使用して学習されており、積極的なインターリーブ SWA（Sparse Window Attention：スパースウィンドウアテンション）とグローバルアテンションを採用し、共有エクスパートは持ちません。Xiaomi はまた、@_LuoFuli 経由でビルダー向けに 100T トークンのグラントも発表しました。Day-0 の推論サポートは vLLM および SGLang/vLLM で迅速に実装されました。

Kimi K2.6 は認知度とデプロイメントにおいて引き続きリードしています：@Kimi_Moonshot によると、Kimi K2.6 は現在 OpenRouter の週間リーダーボードで第 1 位です。二次的な報道では、これはコーディングや長期のエージェントタスク向けのモデルであり、4,000 回の協調ステップにわたって最大 300 人の並列サブエージェントへのスケーリングを含むことが説明されています（dl_weekly）。実務者の間では速度と品質のトレードオフについて見解が分かっています：@teortaxesTex 氏は、Hermes 版の Kimi が DeepSeek V4 よりも大幅に遅いものの、V4 では修正できないバグを時として修正できる能力を持っていることを発見しました。

中国モデルの広範なトレンド：複数の投稿で、中国のラボがオープン系に近いエージェント指向・長文コンテキストシステムへの積極的な推進を続けていることが指摘されました。具体的には、Qwen 3.6 Flash、DeepSeek V4/Flash、GLM-5.1（三重の使用量拡張プロモーション）、そして Xiaomi の MIT リリースなどが挙げられています。繰り返し見られるテーマは、小型・低コストなバリエーションが、実用的なエージェントベンチマークにおいて大型モデルを上回る性能を発揮しているという点です。

エージェントランタイム、オーケストレーション、ローカルファーストのツール類

Sakana の Conductor は注目すべきマルチエージェントの結果です：@SakanaAILabs は、7B の Conductor を RL（強化学習）で訓練し、タスクを直接解決するのではなく、自然言語によってフロンティアモデルのプールをオーケストレーションするように設計しました。これはどのエージェントを呼び出すか、どのサブタスクを割り当てるか、どのコンテキストを公開するかを動的に決定し、LiveCodeBench で 83.9%、GPQA-Diamond で 87.5% を達成し、プール内の単一ワーカーをすべて上回る性能を示しました。@hardmaru は、「AI による AI の管理」や再帰的自己選択を、テストタイムスケーリングの新たな軸として強調しました。

ローカルおよびハイブリッドエージェントの品質が向上し続けています：複数の投稿で、コーディングやアシスタントスタックがローカル環境で動作している様子が示されました。@patloeber と @_philschmid は、LM Studio/Ollama/llama.cpp を介して Pi エージェントと Gemma 4 26B A4B をローカルで実行する様子を記録しました。@googlegemma は、Gemma 4 と WebGPU を使用した完全ローカルのブラウザエージェントのデモを行い、閲覧履歴、タブ管理、ページ要約のためのネイティブツール呼び出し機能を実装しました。@cognition は Devin for Terminal を出荷し、これはローカルのシェルエージェントであり、後でクラウドに処理を委譲することも可能です。

エージェントのエルゴノミクスとフレームワークの進化：ヘルメスは好調な一日を過ごした。@Teknium によると、ヘルメスエージェントのリポジトリは Claude Code を上回り、サポートされている場合はネイティブビジョンがデフォルトとなった。より広いエコシステムは欠落していたピースを埋め続けており、Cline Kanban はタスクカードごとに異なるエージェントやモデルをサポートするようになり、Future AGI は自己改善型エージェントのための評価・最適化スタックをオープンソース化した。また、@_philschmid は MCP が最も効果的に機能するのは、明示的な @mention による読み込みか、サブエージェントスコープでのツール割り当てであり、無差別なサーバーアタッチメントではないと主張した。

推論インフラストラクチャ、アテンション/KV エンジニアリング、およびシステムワーク

Google の TPU の分割は意味のあるアーキテクチャのシグナルである。複数の投稿で、Google の Cloud Next 発表における TPU v8 がトレーニング用の 8t と推論用の 8i に分割され、前世代比で約 2.8 倍の高速なトレーニングと 80% のコスト効率向上を伴う推論性能を実現したことが分析された。@kimmonismus は、これは Google がワークロード別にカスタムシリコンを分割した初の事例であり、OpenAI、Anthropic、Meta が TPU 容量を購入しているとの報道があると強調した。

DeepSeek V4 のサポートはインフラスタックで急速に成熟している。@vllm_project によると、DeepSeek V4 ベースモデルのサポートが到来しており、FP4 インストラクトと FP8 ベースを区別するために expert_dtype 設定フィールドが必要となる。vLLM 0.20.0 リリースのハイライトには、DeepSeek V4 サポート、デフォルト MLA プリフィルとして FA4、TurboQuant 2 ビット KV、Blackwell 上の DeepSeek 固有の MegaMoE パスが含まれた。

KVキャッシュの最適化は依然として激しい争点となっている：長文コンテキストにおけるボトルネックとKV戦略について活発な議論が行われた。@cHHilleeは、長文コンテキストに対する3つの主要な制御レバーを要約した。すなわち、ローカル／スライディング・アテンション、インターリーブされたローカル-グローバル・アテンション、そしてGQA（Grouped Query Attention）／MLA（Mixture of Linear Attention）／KVタイイング／量子化を通じて、グローバル層ごとのKVサイズを小さくする手法である。実装面では、@vllm_projectとRed Hat／AWSがFP8 KVキャッシュの詳細な分析を公開した。この分析では、FA3（Flash Attention 3）における二段階累積の修正により、128kの「干し草の中の針」テストでの成功率が13％から89％に向上しつつ、FP8によるデコード速度の向上を維持することに成功した。コミュニティからの批判者からは、HiSparseのようなオフロード重視のアプローチと比較してDeepSeek V4の特定のKVトレードオフについて疑問が呈された（議論）。

ベンチマーク、評価、そしてオープンな研究課題

オープンワールドでの評価に勢いが付いている：@sarahookrは、現在のエージェント向けベンチマークのほとんどが自動検証可能なタスクに対して過学習している一方で、重要な最前線はオープンワールドで、不確実性が高く完全に検証できない作業にあると主張した。関連するスレッドでは、これ継続的学習（Continual Learning）、メモリストア、適応型データシステムと結びつけられている（1, 2）。

コスト意識型のエージェント評価が第一級のものになりつつある：@dair_aiは、SWE-bench Verified上でのコーディングエージェントの消費に関する新しい研究を強調した。この研究によると、エージェントによるコーディングはチャットやコード推論に比べて約1000倍のトークンを消費する可能性があり、同一タスクに対する実行間での使用量が30倍も変動することがある。また、より多くの消費が必ずしも精度の単調な向上につながるとは限らない。これはCopilotによる価格モデルの変更や、制御不能なエージェントの実行経済性への高まる懸念と一致している。

新たなベンチマークとドメイン固有の評価：LlamaIndexのParseBenchは、パースエージェント向けに2,000ページの検証済みエンタープライズ文書ページを追加しました。AgentIRは、クエリ alongside 推論トレースを埋め込むことで研究エージェントの検索を再定義し、AgentIR-4BはBrowseComp-Plusで68%のスコアを記録しました。これは、より大規模な従来の埋め込みモデル（52%）を上回る結果です。また、最先端モデルのベンチマークスナップショットも複数報告されました（例：Opus 4.7がGSOで42.2%のスコアを記録し、WeirdML / ALE-Bench / PencilPuzzleBenchに関する議論も活発）。しかし、より重要な示唆は方法論にあります。最終的な回答の正確性だけでなく、ランタイムコスト、検索品質、オープンワールドでの振る舞いを測定する人が増えているという点です。

エンゲージメント数の多いトップツイート

OpenAIとMicrosoftのパートナーシップ再設定：@samaによるクロスクラウドでの利用可能性と、Microsoftとの継続的なパートナーシップについて。

AWS上のOpenAI：@ajassyによる、OpenAIモデルがBedrockに提供されることの確認。

GitHub Copilotの価格変更：@githubによる、6月1日より従量制課金へ移行する発表。

Xiaomi MiMo-V2.5のオープンソースリリース：@XiaomiMiMoによるMITライセンス付与および100万トークンのコンテキスト長。

Codex向けのオープンソースオーケストレーション：@OpenAIDevsによるSymphonyのローンチ。

Gemmaローカルブラウザエージェント：@googlegemmaによる、WebGPUを使用した100%ローカルでブラウザ内に常驻するエージェントのデモンストレーション。

AI Reddit recap

/r/LocalLlama + /r/localLLM recap

技術寄りの低いAIサブレディット recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

AI Discord サーバー

残念ながら、Discord は本日アクセスを停止しました。この形式で再開することはありませんが、新しい AINews を近日中に配信する予定です。ここまで読んでいただきありがとうございました。良い展開でした。

原文を表示

a quiet day.

AI News for 4/26/2026-4/27/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Distribution Shift, GPT-5.5 Benchmarks, and Codex/Copilot Pricing Signals

OpenAI loosens Azure exclusivity: @sama said OpenAI updated its Microsoft partnership so Microsoft remains the primary cloud, but OpenAI can now make products available across all clouds, with product/model commitments extending to 2032 and revenue share through 2030. The implication was quickly drawn by @scaling01 and @kimmonismus: OpenAI can now distribute via Google TPU / AWS Trainium / Bedrock, and Microsoft’s license to OpenAI IP becomes non-exclusive. @ajassy confirmed OpenAI models are coming to AWS Bedrock in the coming weeks. @simonw noted the new language likely means the old AGI clause is effectively gone.

GPT-5.5 is a broad upgrade, but not uniformly dominant: Community evals from @htihle put GPT-5.5 no-thinking at 67.1% on WeirdML, up from 57.4% for GPT-5.4, but still behind Opus 4.7 no-thinking at 76.4% while using fewer tokens. LMSYS Arena results from @arena placed GPT-5.5 at #9 in Code Arena, #6 Document, #7 Text, #3 Math, #2 Search, #5 Vision, with Expert Arena #5. Arena also clarified current evaluation covers medium/high reasoning, with xHigh still pending (1, 2). Practitioner feedback was positive for hard coding tasks such as GPU kernels from @gdb, but there were also reports of “compressed CoT leakage” / malformed outputs in no-thinking mode from @htihle.

Developer economics are becoming more explicit: GitHub announced Copilot moves to usage-based billing on June 1, a notable shift as agentic workflows consume much more runtime. Parallel to that, @Hangsiin documented Codex usage multipliers: GPT-5.4 fast = 2x, GPT-5.5 fast = 2.5x, with 5.4-mini and GPT-5.3-Codex materially cheaper. @sama argued Codex at $20 remains a strong value. OpenAI also open-sourced Symphony, an orchestration layer connecting issue trackers to Codex agents for “open issue → agent → PR → human review,” via @OpenAIDevs.

Xiaomi MiMo-V2.5, Kimi K2.6, and China’s Agent-Oriented Open-Weights Push

MiMo-V2.5 is one of the day’s biggest open releases: @XiaomiMiMo open-sourced MiMo‑V2.5-Pro and MiMo‑V2.5 under MIT, both with 1M-token context. The Pro model is framed as a complex agent/coding model and the smaller model as a native omni-modal agent. Community summaries from @eliebakouch add useful technical details: MiMo‑V2.5-Pro is roughly 1T total / 42B active, trained on 27T tokens in FP8, while MiMo‑V2.5 is about 310B total / 15B active, trained on 48T tokens, with aggressive interleaved SWA/global attention and no shared expert. Xiaomi also announced a 100T token grant for builders via @_LuoFuli. Day-0 inference support landed quickly in vLLM and SGLang/vLLM.

Kimi K2.6 continues to lead in mindshare and deployment: @Kimi_Moonshot said Kimi K2.6 is now #1 on OpenRouter’s weekly leaderboard. Secondary reporting described it as a model for coding and long-horizon agents, including scaling to 300 concurrent sub-agents across 4,000 coordinated steps (dl_weekly). Practitioners remain split on speed/quality tradeoffs: @teortaxesTex found Kimi in Hermes much slower than DeepSeek V4 but sometimes capable of fixing bugs V4 could not.

Broader China-model trend: Multiple posts framed Chinese labs as pushing aggressively on open-ish, agent-oriented, long-context systems: Qwen 3.6 Flash, DeepSeek V4/Flash, GLM-5.1 promotions (triple usage extension), and Xiaomi’s MIT release. A recurring theme was that smaller / cheaper variants are often outperforming their larger siblings on practical agent benchmarks.

Agent Runtimes, Orchestration, and Local-First Tooling

Sakana’s Conductor is a notable multi-agent result: @SakanaAILabs introduced a 7B Conductor trained with RL to orchestrate a pool of frontier models in natural language rather than solving tasks directly. It dynamically decides which agent to call, what subtask to assign, and which context to expose, and reportedly reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond, beating any single worker in its pool. @hardmaru highlighted “AI managing AI” and recursive self-selection as a new axis of test-time scaling.

Local and hybrid agents keep getting better: Several posts showed coding/assistant stacks running locally. @patloeber and @_philschmid documented running Pi agent + Gemma 4 26B A4B locally via LM Studio/Ollama/llama.cpp. @googlegemma demoed a fully local browser agent using Gemma 4 + WebGPU, with native tool calling for browsing history, tab management, and page summarization. @cognition shipped Devin for Terminal, a local shell agent that can later hand off to the cloud.

Agent ergonomics and framework evolution: Hermes had a strong day: @Teknium noted Hermes Agent’s repo surpassed Claude Code, while native vision became the default when supported. The broader ecosystem kept filling in missing pieces: Cline Kanban now supports different agents/models per task card; Future AGI open-sourced an eval/optimization stack for self-improving agents; and @_philschmid argued MCP works best either through explicit @mention loading or subagent-scoped tool assignment, not indiscriminate server attachment.

Inference Infrastructure, Attention/KV Engineering, and Systems Work

Google’s TPU split is a meaningful architecture signal: Several posts dissected Google’s Cloud Next announcement that TPU v8 is split into 8t for training and 8i for inference, with claims of roughly 2.8x faster training and 80% better inference performance/$ than prior generation. @kimmonismus emphasized this is the first time Google split custom silicon by workload and that OpenAI, Anthropic, and Meta are reportedly buying TPU capacity.

DeepSeek V4 support is maturing quickly in infra stacks: @vllm_project said support for DeepSeek V4 base models is coming, requiring an expert_dtype config field to distinguish FP4 instruct vs FP8 base. In the vLLM 0.20.0 release, highlights included DeepSeek V4 support, FA4 as default MLA prefill, TurboQuant 2-bit KV, and a DeepSeek-specific MegaMoE path on Blackwell.

KV cache optimization remains a hot battleground: There was dense discussion around long-context bottlenecks and KV strategies. @cHHillee summarized three main levers for long contexts: local/sliding attention, interleaved local-global attention, and smaller KV per global layer via GQA/MLA/KV tying/quantization. On the implementation side, @vllm_project and Red Hat/AWS published an FP8 KV-cache deep dive where a fix to FA3 two-level accumulation improved 128k needle-in-a-haystack from 13% to 89% while retaining FP8 decode speedups. Community critics also questioned DeepSeek V4’s specific KV tradeoffs relative to offloading-heavy approaches such as HiSparse (discussion).

Benchmarks, Evals, and Open Research Directions

Open-world evaluation is gaining momentum: @sarahookr argued that most agentic benchmarks are overfit to automatically verifiable tasks, while the important frontier is open-world, uncertain, non-fully-verifiable work. Related threads connected this to continual learning, memory stores, and adaptive data systems (1, 2).

Cost-aware agent evaluation is becoming first-class: @dair_ai highlighted a new study on coding-agent spend over SWE-bench Verified: agentic coding can consume ~1000x more tokens than chat/code reasoning, usage can vary 30x across runs on identical tasks, and more spending does not monotonically improve accuracy. This lines up with pricing-model changes from Copilot and growing concern over uncontrolled agent runtime economics.

New benchmarks and domain-specific evals: ParseBench from LlamaIndex adds 2k verified enterprise document pages for parsing agents. AgentIR reframes retrieval for research agents by embedding the reasoning trace alongside the query, with AgentIR-4B hitting 68% on BrowseComp-Plus vs 52% for larger conventional embedding models. There were also several benchmark snapshots for frontier models—e.g. Opus 4.7 leading GSO at 42.2% and WeirdML / ALE-Bench / PencilPuzzleBench chatter—but the stronger signal was methodological: more people are measuring runtime cost, retrieval quality, and open-world behavior, not just final answer accuracy.

Top tweets (by engagement)

OpenAI–Microsoft partnership reset: @sama on cross-cloud availability and continued Microsoft partnership.

OpenAI on AWS: @ajassy confirming OpenAI models are coming to Bedrock.

GitHub Copilot pricing change: @github announcing usage-based billing starting June 1.

Xiaomi MiMo-V2.5 open-source release: @XiaomiMiMo with MIT license and 1M context.

Open-source orchestration for Codex: @OpenAIDevs launching Symphony.

Gemma local browser agent: @googlegemma showing a 100% local browser-resident agent with WebGPU.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

この記事をシェア

Smol AI News2026年3月30日 14:44

本日の動向：AnthropicとOpenAIのClaude Code統合機能発表

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

TechCrunch AI重要度42026年6月26日 08:34

ホワイトハウス、安全性の懸念から OpenAI の新モデルリリースを徐々に行うよう要請

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年4月27日 14:44·約12分

本日は特に目立った出来事なし

#OpenAI #GPT-5.5 #マルチクラウド #Microsoft Azure #AWS Bedrock #ベンチマーク

TL;DR

AI深層分析2026年4月28日 10:44

重要/ 5段階

深度40%

キーポイント

OpenAIのクラウド独占解除とパートナーシップ更新

GPT-5.5の性能評価と競合他社との比較

開発者からの実用性に関するフィードバック

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI Twitter recap

OpenAIのディストリビューションシフト、GPT-5.5ベンチマーク、Codex/Copilotの価格設定シグナル**

OpenAIはAzure独占性を緩和：@sama氏は、OpenAIがMicrosoftとのパートナーシップを更新し、Microsoftが主要なクラウド提供者であり続ける一方で、OpenAIは現在すべてのクラウドに製品を提供できるようになったと述べました。製品/モデルのコミットメントは2032年まで、収益配分は2030年まで延長されます。@scaling01氏と@kimmonismus氏はすぐにこの意味を指摘しました：OpenAIは現在、Google TPU / AWS Trainium / Bedrock経由で配布可能であり、MicrosoftのOpenAI知財へのライセンスは非独占になります。@ajassy氏は、OpenAIモデルが数週間以内にAWS Bedrockに登場すると確認しました。@simw氏は、新しい文言は古いAGI条項が事実上消滅したことを意味すると指摘しました。

GPT-5.5は広範なアップグレードではあるが、一様に優位というわけではない：@htihleによるコミュニティ評価によると、GPT-5.5の「思考なし」モードはWeirdMLで57.4%だったGPT-5.4から67.1%に向上したが、より少ないトークン数で76.4%を記録するOpus 4.7の「思考なし」モードには依然として劣っている。@arenaによるLMSYS Arenaの結果では、GPT-5.5はCode Arenaで9位、Documentで6位、Textで7位、Mathで3位、Searchで2位、Visionで5位、Expert Arenaで5位となった。また、Arenaは現在の評価が中・高度な推論（medium/high reasoning）をカバーしており、xHighはまだ保留中であることを明らかにした（1, 2）。@gdbからの実務家のフィードバックでは、GPUカーネルなどの難易度の高いコーディングタスクに対して肯定的な意見があった一方、@htihleからは「思考なし」モードにおける「圧縮されたCoT（Chain of Thought）の漏洩」や malformed な出力に関する報告もあった。

開発者の経済モデルがより明確になっている：GitHubは6月1日よりCopilotの使用量ベースの課金に移行すると発表した。これは、エージェント型ワークフローがより多くのランタイムを消費するようになる中、重要な転換点である。並行して、@HangsiinはCodexの使用量乗数を文書化した：GPT-5.4 fast = 2倍、GPT-5.5 fast = 2.5倍であり、5.4-miniおよびGPT-5.3-Codexは大幅に安価である。@samaは、Codexが20ドルでも強力な価値提供であると主張した。OpenAIもまた、Issue TrackerからCodexエージェントへ「オープンなイシュー → エージェント → PR（プルリクエスト）→ 人間のレビュー」をつなぐオーケストレーションレイヤーであるSymphonyを、@OpenAIDevsを通じてオープンソース化した。

Xiaomi MiMo-V2.5、Kimi K2.6、そして中国のエージェント指向のオープンウェイト推進

MiMo-V2.5 は本日の主要なオープンソースリリースの一つです：@XiaomiMiMo が MIT ライセンスの下で MiMo-V2.5-Pro と MiMo-V2.5 をオープンソース化し、両方とも 100 万トークンのコンテキスト長に対応しています。Pro モデルは複雑なエージェントやコーディングモデルとして位置づけられ、小規模モデルはネイティブのオムニモーダルエージェントとして提供されています。@eliebakouch によるコミュニティの要約には有用な技術詳細が含まれています：MiMo-V2.5-Pro は総パラメータ数約 1T、アクティブパラメータ数 42B で、FP8 精度で 27T トークンを使用して学習されています。一方、MiMo-V2.5 は総パラメータ数約 310B、アクティブパラメータ数 15B で、48T トークンを使用して学習されており、積極的なインターリーブ SWA（Sparse Window Attention：スパースウィンドウアテンション）とグローバルアテンションを採用し、共有エクスパートは持ちません。Xiaomi はまた、@_LuoFuli 経由でビルダー向けに 100T トークンのグラントも発表しました。Day-0 の推論サポートは vLLM および SGLang/vLLM で迅速に実装されました。

Kimi K2.6 は認知度とデプロイメントにおいて引き続きリードしています：@Kimi_Moonshot によると、Kimi K2.6 は現在 OpenRouter の週間リーダーボードで第 1 位です。二次的な報道では、これはコーディングや長期のエージェントタスク向けのモデルであり、4,000 回の協調ステップにわたって最大 300 人の並列サブエージェントへのスケーリングを含むことが説明されています（dl_weekly）。実務者の間では速度と品質のトレードオフについて見解が分かっています：@teortaxesTex 氏は、Hermes 版の Kimi が DeepSeek V4 よりも大幅に遅いものの、V4 では修正できないバグを時として修正できる能力を持っていることを発見しました。

中国モデルの広範なトレンド：複数の投稿で、中国のラボがオープン系に近いエージェント指向・長文コンテキストシステムへの積極的な推進を続けていることが指摘されました。具体的には、Qwen 3.6 Flash、DeepSeek V4/Flash、GLM-5.1（三重の使用量拡張プロモーション）、そして Xiaomi の MIT リリースなどが挙げられています。繰り返し見られるテーマは、小型・低コストなバリエーションが、実用的なエージェントベンチマークにおいて大型モデルを上回る性能を発揮しているという点です。

エージェントランタイム、オーケストレーション、ローカルファーストのツール類

Sakana の Conductor は注目すべきマルチエージェントの結果です：@SakanaAILabs は、7B の Conductor を RL（強化学習）で訓練し、タスクを直接解決するのではなく、自然言語によってフロンティアモデルのプールをオーケストレーションするように設計しました。これはどのエージェントを呼び出すか、どのサブタスクを割り当てるか、どのコンテキストを公開するかを動的に決定し、LiveCodeBench で 83.9%、GPQA-Diamond で 87.5% を達成し、プール内の単一ワーカーをすべて上回る性能を示しました。@hardmaru は、「AI による AI の管理」や再帰的自己選択を、テストタイムスケーリングの新たな軸として強調しました。

ローカルおよびハイブリッドエージェントの品質が向上し続けています：複数の投稿で、コーディングやアシスタントスタックがローカル環境で動作している様子が示されました。@patloeber と @_philschmid は、LM Studio/Ollama/llama.cpp を介して Pi エージェントと Gemma 4 26B A4B をローカルで実行する様子を記録しました。@googlegemma は、Gemma 4 と WebGPU を使用した完全ローカルのブラウザエージェントのデモを行い、閲覧履歴、タブ管理、ページ要約のためのネイティブツール呼び出し機能を実装しました。@cognition は Devin for Terminal を出荷し、これはローカルのシェルエージェントであり、後でクラウドに処理を委譲することも可能です。

エージェントのエルゴノミクスとフレームワークの進化：ヘルメスは好調な一日を過ごした。@Teknium によると、ヘルメスエージェントのリポジトリは Claude Code を上回り、サポートされている場合はネイティブビジョンがデフォルトとなった。より広いエコシステムは欠落していたピースを埋め続けており、Cline Kanban はタスクカードごとに異なるエージェントやモデルをサポートするようになり、Future AGI は自己改善型エージェントのための評価・最適化スタックをオープンソース化した。また、@_philschmid は MCP が最も効果的に機能するのは、明示的な @mention による読み込みか、サブエージェントスコープでのツール割り当てであり、無差別なサーバーアタッチメントではないと主張した。

推論インフラストラクチャ、アテンション/KV エンジニアリング、およびシステムワーク

Google の TPU の分割は意味のあるアーキテクチャのシグナルである。複数の投稿で、Google の Cloud Next 発表における TPU v8 がトレーニング用の 8t と推論用の 8i に分割され、前世代比で約 2.8 倍の高速なトレーニングと 80% のコスト効率向上を伴う推論性能を実現したことが分析された。@kimmonismus は、これは Google がワークロード別にカスタムシリコンを分割した初の事例であり、OpenAI、Anthropic、Meta が TPU 容量を購入しているとの報道があると強調した。

DeepSeek V4 のサポートはインフラスタックで急速に成熟している。@vllm_project によると、DeepSeek V4 ベースモデルのサポートが到来しており、FP4 インストラクトと FP8 ベースを区別するために expert_dtype 設定フィールドが必要となる。vLLM 0.20.0 リリースのハイライトには、DeepSeek V4 サポート、デフォルト MLA プリフィルとして FA4、TurboQuant 2 ビット KV、Blackwell 上の DeepSeek 固有の MegaMoE パスが含まれた。

KVキャッシュの最適化は依然として激しい争点となっている：長文コンテキストにおけるボトルネックとKV戦略について活発な議論が行われた。@cHHilleeは、長文コンテキストに対する3つの主要な制御レバーを要約した。すなわち、ローカル／スライディング・アテンション、インターリーブされたローカル-グローバル・アテンション、そしてGQA（Grouped Query Attention）／MLA（Mixture of Linear Attention）／KVタイイング／量子化を通じて、グローバル層ごとのKVサイズを小さくする手法である。実装面では、@vllm_projectとRed Hat／AWSがFP8 KVキャッシュの詳細な分析を公開した。この分析では、FA3（Flash Attention 3）における二段階累積の修正により、128kの「干し草の中の針」テストでの成功率が13％から89％に向上しつつ、FP8によるデコード速度の向上を維持することに成功した。コミュニティからの批判者からは、HiSparseのようなオフロード重視のアプローチと比較してDeepSeek V4の特定のKVトレードオフについて疑問が呈された（議論）。

ベンチマーク、評価、そしてオープンな研究課題

オープンワールドでの評価に勢いが付いている：@sarahookrは、現在のエージェント向けベンチマークのほとんどが自動検証可能なタスクに対して過学習している一方で、重要な最前線はオープンワールドで、不確実性が高く完全に検証できない作業にあると主張した。関連するスレッドでは、これ継続的学習（Continual Learning）、メモリストア、適応型データシステムと結びつけられている（1, 2）。

コスト意識型のエージェント評価が第一級のものになりつつある：@dair_aiは、SWE-bench Verified上でのコーディングエージェントの消費に関する新しい研究を強調した。この研究によると、エージェントによるコーディングはチャットやコード推論に比べて約1000倍のトークンを消費する可能性があり、同一タスクに対する実行間での使用量が30倍も変動することがある。また、より多くの消費が必ずしも精度の単調な向上につながるとは限らない。これはCopilotによる価格モデルの変更や、制御不能なエージェントの実行経済性への高まる懸念と一致している。

新たなベンチマークとドメイン固有の評価：LlamaIndexのParseBenchは、パースエージェント向けに2,000ページの検証済みエンタープライズ文書ページを追加しました。AgentIRは、クエリ alongside 推論トレースを埋め込むことで研究エージェントの検索を再定義し、AgentIR-4BはBrowseComp-Plusで68%のスコアを記録しました。これは、より大規模な従来の埋め込みモデル（52%）を上回る結果です。また、最先端モデルのベンチマークスナップショットも複数報告されました（例：Opus 4.7がGSOで42.2%のスコアを記録し、WeirdML / ALE-Bench / PencilPuzzleBenchに関する議論も活発）。しかし、より重要な示唆は方法論にあります。最終的な回答の正確性だけでなく、ランタイムコスト、検索品質、オープンワールドでの振る舞いを測定する人が増えているという点です。

エンゲージメント数の多いトップツイート

OpenAIとMicrosoftのパートナーシップ再設定：@samaによるクロスクラウドでの利用可能性と、Microsoftとの継続的なパートナーシップについて。

AWS上のOpenAI：@ajassyによる、OpenAIモデルがBedrockに提供されることの確認。

GitHub Copilotの価格変更：@githubによる、6月1日より従量制課金へ移行する発表。

Xiaomi MiMo-V2.5のオープンソースリリース：@XiaomiMiMoによるMITライセンス付与および100万トークンのコンテキスト長。

Codex向けのオープンソースオーケストレーション：@OpenAIDevsによるSymphonyのローンチ。

Gemmaローカルブラウザエージェント：@googlegemmaによる、WebGPUを使用した100%ローカルでブラウザ内に常驻するエージェントのデモンストレーション。

AI Reddit recap

/r/LocalLlama + /r/localLLM recap

技術寄りの低いAIサブレディット recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

AI Discord サーバー

原文を表示

a quiet day.

AI News for 4/26/2026-4/27/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Distribution Shift, GPT-5.5 Benchmarks, and Codex/Copilot Pricing Signals

OpenAI loosens Azure exclusivity: @sama said OpenAI updated its Microsoft partnership so Microsoft remains the primary cloud, but OpenAI can now make products available across all clouds, with product/model commitments extending to 2032 and revenue share through 2030. The implication was quickly drawn by @scaling01 and @kimmonismus: OpenAI can now distribute via Google TPU / AWS Trainium / Bedrock, and Microsoft’s license to OpenAI IP becomes non-exclusive. @ajassy confirmed OpenAI models are coming to AWS Bedrock in the coming weeks. @simonw noted the new language likely means the old AGI clause is effectively gone.

GPT-5.5 is a broad upgrade, but not uniformly dominant: Community evals from @htihle put GPT-5.5 no-thinking at 67.1% on WeirdML, up from 57.4% for GPT-5.4, but still behind Opus 4.7 no-thinking at 76.4% while using fewer tokens. LMSYS Arena results from @arena placed GPT-5.5 at #9 in Code Arena, #6 Document, #7 Text, #3 Math, #2 Search, #5 Vision, with Expert Arena #5. Arena also clarified current evaluation covers medium/high reasoning, with xHigh still pending (1, 2). Practitioner feedback was positive for hard coding tasks such as GPU kernels from @gdb, but there were also reports of “compressed CoT leakage” / malformed outputs in no-thinking mode from @htihle.

Developer economics are becoming more explicit: GitHub announced Copilot moves to usage-based billing on June 1, a notable shift as agentic workflows consume much more runtime. Parallel to that, @Hangsiin documented Codex usage multipliers: GPT-5.4 fast = 2x, GPT-5.5 fast = 2.5x, with 5.4-mini and GPT-5.3-Codex materially cheaper. @sama argued Codex at $20 remains a strong value. OpenAI also open-sourced Symphony, an orchestration layer connecting issue trackers to Codex agents for “open issue → agent → PR → human review,” via @OpenAIDevs.

Xiaomi MiMo-V2.5, Kimi K2.6, and China’s Agent-Oriented Open-Weights Push

MiMo-V2.5 is one of the day’s biggest open releases: @XiaomiMiMo open-sourced MiMo‑V2.5-Pro and MiMo‑V2.5 under MIT, both with 1M-token context. The Pro model is framed as a complex agent/coding model and the smaller model as a native omni-modal agent. Community summaries from @eliebakouch add useful technical details: MiMo‑V2.5-Pro is roughly 1T total / 42B active, trained on 27T tokens in FP8, while MiMo‑V2.5 is about 310B total / 15B active, trained on 48T tokens, with aggressive interleaved SWA/global attention and no shared expert. Xiaomi also announced a 100T token grant for builders via @_LuoFuli. Day-0 inference support landed quickly in vLLM and SGLang/vLLM.

Kimi K2.6 continues to lead in mindshare and deployment: @Kimi_Moonshot said Kimi K2.6 is now #1 on OpenRouter’s weekly leaderboard. Secondary reporting described it as a model for coding and long-horizon agents, including scaling to 300 concurrent sub-agents across 4,000 coordinated steps (dl_weekly). Practitioners remain split on speed/quality tradeoffs: @teortaxesTex found Kimi in Hermes much slower than DeepSeek V4 but sometimes capable of fixing bugs V4 could not.

Broader China-model trend: Multiple posts framed Chinese labs as pushing aggressively on open-ish, agent-oriented, long-context systems: Qwen 3.6 Flash, DeepSeek V4/Flash, GLM-5.1 promotions (triple usage extension), and Xiaomi’s MIT release. A recurring theme was that smaller / cheaper variants are often outperforming their larger siblings on practical agent benchmarks.

Agent Runtimes, Orchestration, and Local-First Tooling

Sakana’s Conductor is a notable multi-agent result: @SakanaAILabs introduced a 7B Conductor trained with RL to orchestrate a pool of frontier models in natural language rather than solving tasks directly. It dynamically decides which agent to call, what subtask to assign, and which context to expose, and reportedly reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond, beating any single worker in its pool. @hardmaru highlighted “AI managing AI” and recursive self-selection as a new axis of test-time scaling.

Local and hybrid agents keep getting better: Several posts showed coding/assistant stacks running locally. @patloeber and @_philschmid documented running Pi agent + Gemma 4 26B A4B locally via LM Studio/Ollama/llama.cpp. @googlegemma demoed a fully local browser agent using Gemma 4 + WebGPU, with native tool calling for browsing history, tab management, and page summarization. @cognition shipped Devin for Terminal, a local shell agent that can later hand off to the cloud.

Agent ergonomics and framework evolution: Hermes had a strong day: @Teknium noted Hermes Agent’s repo surpassed Claude Code, while native vision became the default when supported. The broader ecosystem kept filling in missing pieces: Cline Kanban now supports different agents/models per task card; Future AGI open-sourced an eval/optimization stack for self-improving agents; and @_philschmid argued MCP works best either through explicit @mention loading or subagent-scoped tool assignment, not indiscriminate server attachment.

Inference Infrastructure, Attention/KV Engineering, and Systems Work

Google’s TPU split is a meaningful architecture signal: Several posts dissected Google’s Cloud Next announcement that TPU v8 is split into 8t for training and 8i for inference, with claims of roughly 2.8x faster training and 80% better inference performance/$ than prior generation. @kimmonismus emphasized this is the first time Google split custom silicon by workload and that OpenAI, Anthropic, and Meta are reportedly buying TPU capacity.

DeepSeek V4 support is maturing quickly in infra stacks: @vllm_project said support for DeepSeek V4 base models is coming, requiring an expert_dtype config field to distinguish FP4 instruct vs FP8 base. In the vLLM 0.20.0 release, highlights included DeepSeek V4 support, FA4 as default MLA prefill, TurboQuant 2-bit KV, and a DeepSeek-specific MegaMoE path on Blackwell.

KV cache optimization remains a hot battleground: There was dense discussion around long-context bottlenecks and KV strategies. @cHHillee summarized three main levers for long contexts: local/sliding attention, interleaved local-global attention, and smaller KV per global layer via GQA/MLA/KV tying/quantization. On the implementation side, @vllm_project and Red Hat/AWS published an FP8 KV-cache deep dive where a fix to FA3 two-level accumulation improved 128k needle-in-a-haystack from 13% to 89% while retaining FP8 decode speedups. Community critics also questioned DeepSeek V4’s specific KV tradeoffs relative to offloading-heavy approaches such as HiSparse (discussion).

Benchmarks, Evals, and Open Research Directions

Open-world evaluation is gaining momentum: @sarahookr argued that most agentic benchmarks are overfit to automatically verifiable tasks, while the important frontier is open-world, uncertain, non-fully-verifiable work. Related threads connected this to continual learning, memory stores, and adaptive data systems (1, 2).

Cost-aware agent evaluation is becoming first-class: @dair_ai highlighted a new study on coding-agent spend over SWE-bench Verified: agentic coding can consume ~1000x more tokens than chat/code reasoning, usage can vary 30x across runs on identical tasks, and more spending does not monotonically improve accuracy. This lines up with pricing-model changes from Copilot and growing concern over uncontrolled agent runtime economics.

New benchmarks and domain-specific evals: ParseBench from LlamaIndex adds 2k verified enterprise document pages for parsing agents. AgentIR reframes retrieval for research agents by embedding the reasoning trace alongside the query, with AgentIR-4B hitting 68% on BrowseComp-Plus vs 52% for larger conventional embedding models. There were also several benchmark snapshots for frontier models—e.g. Opus 4.7 leading GSO at 42.2% and WeirdML / ALE-Bench / PencilPuzzleBench chatter—but the stronger signal was methodological: more people are measuring runtime cost, retrieval quality, and open-world behavior, not just final answer accuracy.

Top tweets (by engagement)

OpenAI–Microsoft partnership reset: @sama on cross-cloud availability and continued Microsoft partnership.

OpenAI on AWS: @ajassy confirming OpenAI models are coming to Bedrock.

GitHub Copilot pricing change: @github announcing usage-based billing starting June 1.

Xiaomi MiMo-V2.5 open-source release: @XiaomiMiMo with MIT license and 1M context.

Open-source orchestration for Codex: @OpenAIDevs launching Symphony.

Gemma local browser agent: @googlegemma showing a 100% local browser-resident agent with WebGPU.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

この記事をシェア

Smol AI News2026年3月30日 14:44

本日の動向：AnthropicとOpenAIのClaude Code統合機能発表

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

TechCrunch AI重要度42026年6月26日 08:34

ホワイトハウス、安全性の懸念から OpenAI の新モデルリリースを徐々に行うよう要請

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter recap

AI Reddit recap

/r/LocalLlama + /r/localLLM recap

技術寄りの低いAIサブレディット recap

AI Discord サーバー

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Less Technical AI Subreddit Recap

AI Discords

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter recap

AI Reddit recap

/r/LocalLlama + /r/localLLM recap

技術寄りの低いAIサブレディット recap

AI Discord サーバー

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Less Technical AI Subreddit Recap

AI Discords

関連記事