Smol AI News·2026年5月12日 14:44·約17分

本日は特に目立った出来事なし

#Reasoning #Benchmark #Agentic AI #RAG #Google DeepMind

TL;DR

AI ニュースは、数学・医学分野における評価ベンチマークの高度化と、アジェンティックシステムによる科学・数学研究への実質的な介入が加速していることを示しています。

AI深層分析2026年5月13日 12:02

重要/ 5段階

深度40%

キーポイント

研究レベル評価ベンチマークの強化

Soohak が 64 人の数学者により作成された 439 の研究用数学問題を提供し、SophontAI は医療評価ベンチを拡張しました。これらは従来のオリンピック形式の問題を超えた、より困難な課題への対応を求めています。

アジェンティックシステムによる科学・数学の進展

Google DeepMind の「AI Co-Mathematician」や物理分野の専用エージェントが、 FrontierMath や CritPt などの難問で劇的なスコア向上を達成し、研究支援の実用化が進んでいます。

小規模モデルによる検索・推論性能の向上

LightOn の Agent-ModernColBERT は、1.49 億パラメータという軽量な retriever で大規模モデルに匹敵する検索性能を達成し、コスト効率の高いシステム構築の可能性を示しています。

影響分析・編集コメントを表示

影響分析

このニュースは、AI の評価基準が「正解率」から「難問解決能力」へとパラダイムシフトしていることを示しており、開発者はより困難なベンチマークへの対応を迫られています。また、アジェンティック AI が研究プロセスそのものに組み込まれつつあることは、科学技術の発見速度を加速させる新たなインフラとしての役割を確立しつつあります。

編集コメント

「何も起こらなかった日」というタイトルとは裏腹に、評価基準の根本的な転換と、実研究を支援する AI エージェントの実用化という重要な転換点にあることが読み取れます。

静かな一日。

**2026年5月11日〜12日のAIニュース。12のサブレッド、544 のツイート、およびDiscord（ディスコード）については追加情報はありませんでした。AINews のウェブサイトでは過去のすべての号を検索できます。念のためお知らせしますが、AINews は現在 Latent Space のセクションの一部となっています。メールの配信頻度を選択的に設定（購読・解除）することができます！

AI ツイートリキャップ

研究ベンチマーク、ハードな評価、およびエージェント型科学システム**

研究レベルの推論用ベンチマークはますます困難になっています：Soohak は、64人の数学者（そのうち38人が教員）がゼロから作成した439件の研究レベルの数学問題を導入しました。これは標準的なオリンピック形式の数学を超えた能力を明確に狙ったものです。医療評価においては、@SophontAI が Medmarks v1.0 をリリースし、オープンな医療ベンチマークスイートを20件から30件へ、46モデルから61モデルへと拡大しました。また、古い評価指標が飽和しているという認識が高まっており、@polynoamial は、均一に高得点となるベンチマークは廃止し、低得点でフロンティア（最前線）の挑戦を促すテストに移行すべきだと主張しています。

エージェント型システムは、科学や数学のベンチマークの最前線に動き始めています：Google DeepMind の AI Co-Mathematician は数学者向けの非同期かつ状態を保持する研究用ワークベンチとして説明されており、イデアの創出、文献発見、計算分析、定理検証、形式化された出力をサポートしつつ、FrontierMath Tier 4 で約 48% に達したと報じられています。理論物理学では、physics-intern が Gemini 3.1 Pro の CritPt でのスコアを 17.7% から 31.4% に引き上げるために、専門化されたエージェントへの分解を実現しています。コーディングやプログラム合成においては、ProgramBench の最初のタスクが GPT-5.5 high/xhigh によって解決されたと報じられており、xhigh は Opus 4.7 xhigh をあらゆる指標で上回っています。

検索および検索ベンチマークでは、小規模で専門的なモデルが評価されています：LightOn の Agent-ModernColBERT は、検索器を 149M パラメータに保ちつつ、BrowseComp-Plus で Reason-ModernColBERT にさらに約 10% を上乗せし、生成器と組み合わせることではるかに大規模なモデルベースのシステムに匹敵あるいは凌駕するとの主張をしています。@xuzihuan4 による関連議論では、エージェントが自身のクエリを反復的に精緻化できる場合、レキシカル検索（語彙検索）だけで十分ではないかという問いが提起されています。

トレーニング、最適化、およびスケーリング則の技術

オプティマイザーの作業は、トレーニングコストの圧縮と小規模実験の改善を継続中：SOAP/Muon 型更新の高速バリアントに焦点を当てた複数のツイートがありました。@torchcompiled は、SOAP ベース更新に対して接線ステップ（tangent-step）+ Stiefel 多様体再射影（Stiefel manifold retraction）を適用し、安定性のためのドリフトチェックと QR 逆戻り（QR fallback）に関するフォローアップ議論が行われました。Modded-NanoGPT コミュニティでは、SOAP-Muon が 3150 ステップ（-60）という新記録を樹立しました。また、NorMuonH に対する早期の MuLoCo 型外側 Nesterov SGD ラップも結果を改善し、両方とも p 値報告によって裏付けられています。

形式手法とスーパー最適化が、ML システム作業と融合し始めています：@leloykun は、Lean4 から TileLang テンソルプログラムへのスーパーオプティマイザーを紹介しました。これにより FlashAttention2、FlashNorm、split-k 行列乗算などのカーネルを自動的に発見でき、A100 上で幾何平均で約 1.8 倍の高速化を実現したと報告されています。同じフレームワークは、カーネル、オプティマイザー、ハイパーパラメータ転移ルール、スケーリング則（scaling laws）を同時に探索する基盤としても位置づけられています。

スケーリング則とトレーニング指標が再検討されています：@che_shr_cat は、古典的な「パラメータあたり 20 トークン」という枠組みはトークナイザー依存であり、スケーリングはトークンではなくバイト数で測定すべきだと主張しています。一方、@JJitsev は、記述的なスケーリング則が予測だけでなく、スケール間での学習手順を比較するための体系的基盤としても価値があると強調しました。

トレーニング時のみ有効な効率化トリックがより興味深いものになってきています：Nous の Lighthouse Attention は、バニラ・アテンションを囲むサブクアドラティック（2 次未満）のトレーニング用ラッパーとして注目されており、回復フェーズ後にトレーニング終了間際に除去可能で、標準的なデプロイ時の推論を維持しつつ、長文コンテキスト事前学習のコストを削減します。同様の精神に基づき、Prime Intellect の Renderers は、RL トレーナーとエージェント環境間のトークン/メッセージのインピーダンスミスマッチに対処し、人気のあるオープンモデルでスループットが 3 倍以上になると主張しています。

推論システム、サービングスタック、およびランタイムインフラストラクチャ

Blackwell ラックは、大規模 MoE（Mixture of Experts）のサービングにおける参照プラットフォームとして台頭しています：Perplexity は NVIDIA GB200 NVL72 システム上でトレーニング済み Qwen3 235B のサービングに関する詳細を公開し、GB200 が Hopper に比べて大規模 MoE にとって主要な推論ステップアップであると主張しました。同社のベンチマークでは、NVLs のオールリデュース遅延が H200 の 586.1µs から GB200 で 313.3µs に低下し、EP=4 の MoE プレフィルコンバインが 730.1µs から 438.5µs に短縮され、高トークンレートでのデコードスループットも向上しています。@AravSrinivas はこれを、大規模 MoE をサービングするためのプレフィル/デコードのディスアグリゲーション（分離）を本質的に変えるものとして位置付けています。

推論オーケストレーションはますます専門化しており、「単なる Kubernetes」ではなくなっています。Modal は、計算リソース管理、クラウドネイティブキャッシング、CRIU、GPU チェックポイント化に関する取り組みを根拠に、推論には専用のスタックが必要だと主張しています。この位置づけは直ちに Perceptron によって実世界で支持され、同社は「ネイティブ動画、構造化出力、ハイブリッド推論が、珍しいコールドスタートとスケーリング要件を生み出すため、すべての Mk1 推論実行は Modal で行われている」と述べています。

オープンソース（OSS）の推論における経済性は急速に改善し続けています。SemiAnalysis の報告によると、RoCEv2 CX-7 を介して複数の B200 8-GPU マシンをクラスタリングし、PD（Processor-Disk）の分離を行うことで、1 GPU あたりのトークンスループットが最大 7 倍向上し、結果としてトークンあたりのコストも同程度に削減できると示唆されています。ベクトルデータベース側では、Qdrant 1.18 に TurboQuant が追加され、メモリ使用量を半分にしつつスカラー量子化に近い再現率を達成すると主張しています。これに加え、メモリの監視機能や名前付きベクトルのライフサイクル操作も提供されています。

エージェントランタイムはバージョン管理システムのような基盤へと進化しています。注目すべきシステムアイデアの一つが Stanford の Shepherd で、@ai_satoru_chan によって要約されました。これはエージェントの実行を Git に例え、第一級タスク、効果、スコープ、トレース、完全な再生実行、ブランチ作成、ロールバック、そして Lean による形式的保証を提供するものです。報告されている結果には、CooperBench におけるライブ監視の向上（28.8% から 54.7%）、より高速な反事実的最適化、およびツリー RL の展開が含まれます。

製品とモデルのリリース：マルチモーダル、動画、検索、埋め込み

Perceptron Mk1 は、このセットにおける最も実質的な新モデルリリースでした：@perceptroninc が Frontier Video（最先端動画）および Embodied Reasoning（具身推論）のモデルとして Perceptron Mk1 を立ち上げました。ネイティブ動画サポートは最大 2 FPS で、Temporal Grounding（時間的 grounding）、Multimodal In-Context Learning（多モーダル文脈内学習）、構造化された空間出力を備えています。OpenRouter のサマリーによると、32k の多モーダルコンテキストと、ポイント、ボックス、ポリゴン、クリップといった第一級の出力が特徴です。このリリースは汎用的な VLM（Vision-Language Model：視覚言語モデル）というよりは、物理世界推論スタックとして位置づけられています。

Google と Meta はともに、スタンドアロンのモデル仕様ではなく、多モーダルインタラクション層を強化しました：Google DeepMind の AI 搭載マウスポインターデモは、カーソルを Gemini に紐付いた文脈指向のポインティングインターフェースとして再定義し、ユーザーが画面上のコンテンツを指差して簡略化された指示を発話できるようにしています。並行して、Meta は Muse Spark を基盤とした Meta AI の音声会話機能を発表しました。これには中断機能、言語切り替え、画像生成、ライブカメラ接地型インタラクションが含まれています。

Embedding（埋め込み）および Retrieval（検索）モデルの更新も注目されました：Jina は、テキスト、画像、オーディオ、動画に対応するユニバーサルな埋め込みモデル「jina-embeddings-v5-omni」をリリースしました。1.57B 版と 0.95B 版があり、両方とも Matryoshka Truncation（マトリョーシカ型切り捨て）をサポートし、既存の v5-text インデックスとの後方互換性を保っています。Meta は静かに Sapiens2 をリリースしました。これはヒューマンセントリックな高解像度 ViT（Vision Transformer：ビジョントランスフォーマー）ファミリーで、0.1B から 5B パラメータまでをカバーし、ポーズ推定、セグメンテーション、法線ベクトル算出、ポイントマップ生成に対応しています。

Diffusion と画像ツールリングは引き続き進展：Hugging Face の Diffusers 0.38.0 では、Ace-Step 1.5、LongCat-AudioDiT、Ernie-Image を含む新しいパイプラインが追加され、Flash Attention 4、FlashPack ローディング、コンテキスト並列化のための Ring Anything へのサポートも提供されました。その他の研究リリースには、連続空間テキスト拡散モデルである ELF: Embedded Language Flows と、ピクセル整合型 3D 生成を可能にする Tencent の Pixal3D が含まれます。

エージェント、ツールリング、開発者ワークフロー

エージェント製品はデモから運用プラットフォームへと移行中：OpenAI は Symphony を発表し、これはすべての未完了タスクに実行中の Codex エージェントが割り当てられるシステムであり、同時に Codex がアプリ全体で完全乗っ取りなしに動作するためのコンピュータ操作機能も強調しました。LangChain は再オープンソース化した改訂版 Chat LangChain アプリを公開し、週約 2T トークンを処理する生産用 Q&A エージェントとして紹介しています。

長期実行型エージェントの状態管理が、主要なシステム課題へと昇格：LangGraph の新機能 DeltaChannel スナップショットは、スケーラブルで永続的な実行のために完全状態チェックポイントに代わるものを目指しており、LangChain は同様のメカニズムが deepagents v0.6 におけるメッセージ履歴とファイルストレージを駆動していると述べています。このより広範なパターンは、Google の Gemini Interactions API ガイドにも現れており、暗号化された思考署名により、状態あり・状態なしの両モードでターン間の推論コンテキストが保持され、開発者が手動で署名注入を管理する必要がないようになっています。

合成データと RL（強化学習）環境の生成が実用化段階にあります：@Vtrivedy10 は、有用な実践者の視点として、モデル重みからの標的型合成データ抽出はスケーラブルに困難であり、特に長文シーケンスなどの未表現分布においてはなおさらであること、そして効果的なパイプラインにはプログラムによるテスト、検証器、判定器、およびアジェンティックな長期ホライズンの枠組みが必要であることを指摘しました。インフラストラクチャ側では、Tau2-Infinity が失敗仮説からの DAG（有向非巡回グラフ）ウォークや世界生成を通じて、RL 後期トレーニング用のハードなツール使用タスクを自律的に採掘する形式化を行いました。

エンゲージメント上位のツイート（技術的関連性でフィルタリング）:

OS レベルの知能層としての Gemini：Google の Gemini Intelligence、Googlebook、および AI ポインタデモは、アジェンティックな UX がチャットウィンドウからオペレーティングシステムへと移行していることを示唆しています。

Isomorphic Labs の資金調達：@demishassabis は、AI 駆動型創薬のための新たな資金として 21 億ドルを公表しました。これは、適用 AI プラットフォームに直接結びついた本データセット内でも最大規模の資本コミットメントの一つです。

音声から音声へのベンチマーク：Artificial Analysis の τ-Voice ベンチマークでは、最良の S2S（Speech-to-Speech）モデルでさえ現実的なカスタマーサービスシナリオの約半分しか解決できていないことが判明しました。その中で、Grok Voice Think Fast 1.0 が 52.1% で首位に立っています。

Claude Opus 4.7 のファストモード：Anthropic のファストモードリリースが API および Claude Code に到達し、Cursor は 6 倍のコストで 2.5 倍の速度向上を報告しました。これはレイテンシと価格のフロンティアにおける具体的な新たな指標です。

セキュリティ、サプライチェーン、および安全なコーディング

今日の最も緊急性の高い運用上の出来事は、Mini Shai-Hulud サプライチェーン攻撃でした。@IntCyberDigest によると、このキャンペーンは TanStack に限定されず、npm や PyPI を通じて OpenSearch、Mistral AI、Guardrails AI、UiPath などへ拡大し、特に AI 開発者向けツールを標的としています。注目すべき技術的な詳細は永続性です。同攻撃は Claude Code (.claude/settings.json) および VS Code (.vscode/tasks.json) にフックを仕掛け、パッケージを削除した後も将来のツールイベントで侵害が再実行されるように設計されているとされています。Guardrails AI は後ほど、自社の 0.10.1 パッケージが侵害され、約 2 時間以内に隔離されたことを確認しました。

即座に実行可能な対策も提示されました。@ramimacisabird は、minimumReleaseAge の設定に加え、チームは blockExoticSubdeps を有効化して、リモート GitHub リファレンスが依存関係グラフに紛れ込むのを防ぐべきだと指摘しました。@elithrar は、GitHub の pull_request_target が、フォークベースの PR 自動化における最も危険な CI/CD の罠の一つであるとし、その重要性を再確認しました。またワークステーションレベルでは、@andersonbcdefg がシークレット情報を至る所に存在するローカルの .env ファイルから、適切なシークレットマネージャーへ移行することを推奨しました。

より安全なコード生成は独自の研究分野へと成長しています：スタンフォード大学と連携した SecureForge の取り組みでは、プロンプト最適化を通じて LLM 生成コードの脆弱性発見・防止を目標としており、対応する論文リストではこれをコード生成とセキュリティ評価をつなぐ架け橋として位置付けています。より広い視点から言えば、コーディングエージェントはすでに十分に強力になったため、サプライチェーンの強化や安全な生成の評価は、脇道の問題ではなく中核インフラとして扱う必要があるという点です。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP と長文コンテキストのローカル評価

Unsloth 上の MTP（アクティビティ数：727）: この画像は Hugging Face の活動スクリーンショットで、Unsloth AI が MTP を保持した GGUF ビルドを公開・更新している様子を示しています。具体的には unsloth/Qwen3.6-27B-GGUF-MTP と unsloth/Qwen3.6-35B-A3B-GGUF-MTP です。技術的な意義は、これらの GGUF が MTP（Multi-Token Prediction）/次トークン予測の補助層を保持している点ですが、ユーザーはデフォルトの llama.cpp のサポートに頼るのではなく、特定の llama.cpp MTP プルリクエストをチェックアウトしてビルドする必要があると報告されています。あるコメントではランタイムまたはモデル読み込みのアサーションエラー「GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0")」が発生したことが示されており、これらの MTP GGUF に対するツールやメタデータのサポートがまだ不安定であることを示唆しています。コメント投稿者の多くは、アップストリームの推論サポートを待っている状態であり、ある人は llama.cpp や vLLM の GitHub リポジトリを絶えず更新して確認していることを冗談めかして述べています。また、llama.cpp で MTP が「そのまま」サポートされているかどうかについても不透明さがあり、この投稿ではまだ対応していないことが示されています。

新しい 27B GGUF モデルをコンパイル/実行しているユーザーから、qwen35_mtp.cpp でハードアサーション失敗が発生したとの報告がありました。具体的には「GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed」というエラーです。これは、現在の実装において Qwen3.5 MTP の実行に必須である nextn_predict_layers が、読み込まれている GGUF/モデルメタデータから欠落しているか、あるいは公開されていないことを示唆しています。

複数のコメント投稿者が、llama.cpp と vLLM がネイティブの MTP（Multi-Token Prediction）サポートを実装したかどうかを追跡しており、そのうち一人は llama.cpp について明確に質問しています。

原文を表示

a quiet day.

AI News for 5/11/2026-5/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Research Benchmarks, Hard Evals, and Agentic Science Systems

Research-level reasoning benchmarks keep getting harder: Soohak introduces 439 research-level math problems authored from scratch by 64 mathematicians (including 38 faculty), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, @SophontAI released Medmarks v1.0, expanding its open medical benchmark suite from 20→30 benchmarks and 46→61 models. There’s also growing sentiment that old evals are saturating: @polynoamial argues benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.

Agentic systems are starting to move benchmark frontiers in science and math: Google DeepMind’s AI Co-Mathematician is described as an asynchronous, stateful research workbench for mathematicians, reportedly reaching 48% on FrontierMath Tier 4 while supporting ideation, literature discovery, computational analysis, theorem verification, and formal outputs. In theoretical physics, physics-intern boosts Gemini 3.1 Pro from 17.7% to 31.4% on CritPt via decomposition into specialized agents. On coding/program synthesis, ProgramBench’s first task was reportedly solved by GPT-5.5 high/xhigh, with xhigh outperforming Opus 4.7 xhigh across metrics.

Retrieval and search benchmarks are rewarding small, specialized models: LightOn’s Agent-ModernColBERT stacks another ~10% over Reason-ModernColBERT on BrowseComp-Plus while keeping the retriever at 149M parameters, with claims of matching or exceeding much larger model-based systems when paired with a generator. Related discussion from @xuzihuan4 asks whether lexical retrieval may suffice in agentic search loops when agents can iteratively refine their own queries.

Training, Optimization, and Scaling-Law Techniques

Optimizer work continues to compress training cost and improve small-scale experimentation: Several tweets centered on fast variants of SOAP/Muon-style updates. @torchcompiled applied tangent-step + Stiefel manifold retraction to SOAP basis updates, with follow-up discussion on drift checks and QR fallback for stability. In the Modded-NanoGPT community, SOAP-Muon set a new record at 3150 steps (-60), while an earlier MuLoCo-style outer Nesterov SGD wrap on NorMuonH also improved results, both backed by p-value reporting.

Formal methods and superoptimization are beginning to merge with ML systems work: @leloykun described a Lean4-to-TileLang tensor program superoptimizer that can automatically discover kernels such as FlashAttention2, FlashNorm, and split-k matmul, reporting roughly 1.8× geomean speedup on A100s. The same framework is positioned to jointly search over kernels, optimizers, hyperparameter transfer rules, and scaling laws.

Scaling laws and training metrics are being re-examined: @che_shr_cat argues the classic “20 tokens per parameter” framing is tokenizer-dependent and that scaling should be measured in bytes, not tokens. Separately, @JJitsev emphasized that prescriptive scaling laws are valuable not just for prediction, but as a systematic basis for comparing learning procedures across scales.

Training-time-only efficiency tricks are getting more interesting: Lighthouse Attention from Nous is highlighted as a subquadratic training wrapper around vanilla attention that can be removed near the end of training after a recovery phase, preserving standard deployment-time inference while reducing long-context pretraining cost. In a similar spirit, Renderers from Prime Intellect addresses the token/message impedance mismatch between RL trainers and agent environments, claiming >3× throughput on popular open models.

Inference Systems, Serving Stacks, and Runtime Infrastructure

Blackwell racks are emerging as the reference platform for large-MoE serving: Perplexity published details on serving post-trained Qwen3 235B on NVIDIA GB200 NVL72 systems, arguing GB200 is a major inference step up over Hopper for large MoEs. Their benchmarks cite NVLS all-reduce latency dropping from 586.1µs on H200 to 313.3µs on GB200, and MoE prefill combine at EP=4 dropping from 730.1µs to 438.5µs, with better decode throughput at high token rates. @AravSrinivas framed this as materially changing prefill/decode disaggregation for serving large MoEs.

Inference orchestration is increasingly specialized, not “just Kubernetes”: Modal argues inference needs a dedicated stack, citing work on compute management, cloud-native caching, CRIU, and GPU checkpointing. That positioning got an immediate real-world endorsement from Perceptron, which said all Mk1 inference runs on Modal because native video, structured outputs, and hybrid reasoning create unusual cold-start and scaling requirements.

OSS inference economics continue to improve fast: SemiAnalysis reported that clustering multiple B200 8-GPU machines over RoCEv2 CX-7 with PD disaggregation can lift per-GPU token throughput by up to 7×, implying comparable cost-per-token reductions. On the vector DB side, Qdrant 1.18 added TurboQuant, claiming recall near scalar quantization with 2× less memory, alongside memory monitoring and named-vector lifecycle operations.

Agent runtimes are becoming version-control-like substrates: A standout systems idea was Stanford’s Shepherd, summarized by @ai_satoru_chan, which treats agent execution more like Git: first-class tasks, effects, scopes, and traces; exact replay; branching; rollback; and formal guarantees in Lean. Claimed results include live-supervision gains on CooperBench from 28.8%→54.7%, plus faster counterfactual optimization and tree-RL rollouts.

Product and Model Releases: Multimodal, Video, Retrieval, and Embeddings

Perceptron Mk1 was the most substantive new model release in the set: @perceptroninc launched Perceptron Mk1 as a model for frontier video and embodied reasoning, with native video support at up to 2 FPS, temporal grounding, multimodal in-context learning, and structured spatial outputs. OpenRouter’s summary notes a 32k multimodal context and first-class outputs like points, boxes, polygons, and clips. The release is framed less as a generic VLM and more as a physical-world reasoning stack.

Google and Meta both pushed multimodal interaction layers rather than standalone model specs: Google DeepMind’s AI-enabled mouse pointer demos reimagine the cursor as a contextual pointing interface tied to Gemini, allowing users to point at on-screen content and speak shorthand instructions. In parallel, Meta announced Meta AI voice conversations powered by Muse Spark, adding interruption, language switching, image generation, and live camera-grounded interaction.

Embedding and retrieval model updates were notable: Jina released jina-embeddings-v5-omni, a universal embedding model for text, images, audio, and video, in 1.57B and 0.95B variants, both with Matryoshka truncation and backward compatibility with existing v5-text indexes. Meta quietly released Sapiens2, a family of human-centric high-resolution ViTs spanning 0.1B→5B params for pose estimation, segmentation, normals, and pointmaps.

Diffusion and image tooling kept moving: Hugging Face’s Diffusers 0.38.0 added new pipelines including Ace-Step 1.5, LongCat-AudioDiT, and Ernie-Image, plus support for Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Other research releases included ELF: Embedded Language Flows, a continuous-space text diffusion model, and Tencent’s Pixal3D for pixel-aligned 3D generation.

Agents, Tooling, and Developer Workflow

Agent products are shifting from demos to operational platforms: OpenAI teased Symphony as a system where every open task gets a running Codex agent, and separately highlighted computer use for Codex to work across apps without full takeover. LangChain re-open-sourced its revamped Chat LangChain app, describing it as a production Q&A agent handling nearly 2T tokens/week.

Long-running-agent state management is becoming a first-class systems problem: LangGraph’s new DeltaChannel snapshots aim to replace full-state checkpointing for scalable durable execution; LangChain says the same mechanism now powers message histories and file storage in deepagents v0.6. The broader pattern also shows up in Google’s Gemini Interactions API guide, where encrypted thought signatures preserve reasoning context across turns in both stateful and stateless modes without forcing developers to manage signature injection manually.

Synthetic data and RL environment generation are being operationalized: @Vtrivedy10 offered a useful practitioner perspective: targeted synthetic data extraction from model weights is hard at scale, especially for underrepresented distributions like long sequences, and effective pipelines need programmatic tests, verifiers, judges, and agentic long-horizon framing. On the infrastructure side, Tau2-Infinity formalizes autonomous mining of hard tool-use tasks for RL post-training via DAG walks or world-generation from failure hypotheses.

Top tweets (by engagement, filtered for technical relevance):

Gemini as an OS-level intelligence layer: Google’s Gemini Intelligence, Googlebook, and AI pointer demos collectively point to agentic UX moving from chat windows into the operating system.

Isomorphic Labs funding: @demishassabis announced $2.1B in new funding for AI-driven drug discovery, one of the largest capital commitments in this dataset tied directly to an applied AI platform.

Speech-to-speech benchmarking: Artificial Analysis’ τ-Voice benchmark found even the best S2S models solve only about half of realistic customer service scenarios, with Grok Voice Think Fast 1.0 leading at 52.1%.

Claude Opus 4.7 fast mode: Anthropic’s fast mode release reached APIs and Claude Code, with Cursor noting 2.5× speed at 6× cost, a concrete new point on the latency/price frontier.

Security, Supply Chain, and Safer Coding

The most urgent operational story was the Mini Shai-Hulud supply-chain attack: @IntCyberDigest reported the campaign had expanded beyond TanStack to hit OpenSearch, Mistral AI, Guardrails AI, UiPath, and others across npm and PyPI, specifically targeting AI developer tooling. The noteworthy technical detail is persistence: it allegedly hooks into Claude Code (.claude/settings.json) and VS Code (.vscode/tasks.json) so the compromise can re-execute on future tool events even after package removal. Guardrails AI later confirmed its 0.10.1 package was compromised and quarantined within about 2 hours.

Actionable mitigations surfaced quickly: @ramimacisabird noted that beyond minimumReleaseAge, teams should enable blockExoticSubdeps to prevent remote GitHub references from slipping into dependency graphs. @elithrar reiterated that GitHub’s pull_request_target remains one of the sharpest CI/CD footguns for fork-based PR automation. And at the workstation level, @andersonbcdefg recommended moving secrets out of ubiquitous local .env files into a proper secrets manager.

Safer codegen is becoming its own research track: Stanford-aligned work on SecureForge targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, while the corresponding paper listing frames it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

MTP on Unsloth (Activity: 727): The image is a Hugging Face activity screenshot showing Unsloth AI publishing/updating MTP-preserved GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The technical significance is that these GGUFs retain the MTP / next-token-prediction auxiliary layer, but users reportedly still need to checkout and build a specific llama.cpp MTP PR rather than relying on default llama.cpp support. One commenter hit a runtime/model-load assertion, GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting tooling or metadata support is still fragile for these MTP GGUFs. Commenters are mainly waiting on upstream inference support, with one joking about constantly refreshing llama.cpp and vLLM GitHub repos. There is also uncertainty over whether MTP is supported “out of the box” in llama.cpp; the post indicates it is not yet.

A user compiling/running the new 27B GGUF model reports a hard assertion failure in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed. This suggests the GGUF/model metadata being loaded is missing or not exposing nextn_predict_layers, which is required for Qwen3.5 MTP execution in the current implementation.

Several commenters are tracking whether llama.cpp and vLLM have landed native MTP support, with one explicitly asking whether llama.cpp

この記事をシェア

TLDR AI重要度42026年6月29日 09:00

Qwen Image Agent（12 分読み）

TLDR AI重要度42026年6月30日 09:00

RoadmapBench：バージョンアップを跨ぐ長期的エージェント型ソフトウェア開発の評価

TLDR AI2026年6月30日 09:00

30 分で文書処理ワークフローを構築する方法（3 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年5月12日 14:44·約17分

本日は特に目立った出来事なし

#Reasoning #Benchmark #Agentic AI #RAG #Google DeepMind

TL;DR

AI深層分析2026年5月13日 12:02

重要/ 5段階

深度40%

キーポイント

研究レベル評価ベンチマークの強化

アジェンティックシステムによる科学・数学の進展

小規模モデルによる検索・推論性能の向上

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI ツイートリキャップ

研究ベンチマーク、ハードな評価、およびエージェント型科学システム**

研究レベルの推論用ベンチマークはますます困難になっています：Soohak は、64人の数学者（そのうち38人が教員）がゼロから作成した439件の研究レベルの数学問題を導入しました。これは標準的なオリンピック形式の数学を超えた能力を明確に狙ったものです。医療評価においては、@SophontAI が Medmarks v1.0 をリリースし、オープンな医療ベンチマークスイートを20件から30件へ、46モデルから61モデルへと拡大しました。また、古い評価指標が飽和しているという認識が高まっており、@polynoamial は、均一に高得点となるベンチマークは廃止し、低得点でフロンティア（最前線）の挑戦を促すテストに移行すべきだと主張しています。

エージェント型システムは、科学や数学のベンチマークの最前線に動き始めています：Google DeepMind の AI Co-Mathematician は数学者向けの非同期かつ状態を保持する研究用ワークベンチとして説明されており、イデアの創出、文献発見、計算分析、定理検証、形式化された出力をサポートしつつ、FrontierMath Tier 4 で約 48% に達したと報じられています。理論物理学では、physics-intern が Gemini 3.1 Pro の CritPt でのスコアを 17.7% から 31.4% に引き上げるために、専門化されたエージェントへの分解を実現しています。コーディングやプログラム合成においては、ProgramBench の最初のタスクが GPT-5.5 high/xhigh によって解決されたと報じられており、xhigh は Opus 4.7 xhigh をあらゆる指標で上回っています。

検索および検索ベンチマークでは、小規模で専門的なモデルが評価されています：LightOn の Agent-ModernColBERT は、検索器を 149M パラメータに保ちつつ、BrowseComp-Plus で Reason-ModernColBERT にさらに約 10% を上乗せし、生成器と組み合わせることではるかに大規模なモデルベースのシステムに匹敵あるいは凌駕するとの主張をしています。@xuzihuan4 による関連議論では、エージェントが自身のクエリを反復的に精緻化できる場合、レキシカル検索（語彙検索）だけで十分ではないかという問いが提起されています。

トレーニング、最適化、およびスケーリング則の技術

オプティマイザーの作業は、トレーニングコストの圧縮と小規模実験の改善を継続中：SOAP/Muon 型更新の高速バリアントに焦点を当てた複数のツイートがありました。@torchcompiled は、SOAP ベース更新に対して接線ステップ（tangent-step）+ Stiefel 多様体再射影（Stiefel manifold retraction）を適用し、安定性のためのドリフトチェックと QR 逆戻り（QR fallback）に関するフォローアップ議論が行われました。Modded-NanoGPT コミュニティでは、SOAP-Muon が 3150 ステップ（-60）という新記録を樹立しました。また、NorMuonH に対する早期の MuLoCo 型外側 Nesterov SGD ラップも結果を改善し、両方とも p 値報告によって裏付けられています。

トレーニング時のみ有効な効率化トリックがより興味深いものになってきています：Nous の Lighthouse Attention は、バニラ・アテンションを囲むサブクアドラティック（2 次未満）のトレーニング用ラッパーとして注目されており、回復フェーズ後にトレーニング終了間際に除去可能で、標準的なデプロイ時の推論を維持しつつ、長文コンテキスト事前学習のコストを削減します。同様の精神に基づき、Prime Intellect の Renderers は、RL トレーナーとエージェント環境間のトークン/メッセージのインピーダンスミスマッチに対処し、人気のあるオープンモデルでスループットが 3 倍以上になると主張しています。

推論システム、サービングスタック、およびランタイムインフラストラクチャ

Blackwell ラックは、大規模 MoE（Mixture of Experts）のサービングにおける参照プラットフォームとして台頭しています：Perplexity は NVIDIA GB200 NVL72 システム上でトレーニング済み Qwen3 235B のサービングに関する詳細を公開し、GB200 が Hopper に比べて大規模 MoE にとって主要な推論ステップアップであると主張しました。同社のベンチマークでは、NVLs のオールリデュース遅延が H200 の 586.1µs から GB200 で 313.3µs に低下し、EP=4 の MoE プレフィルコンバインが 730.1µs から 438.5µs に短縮され、高トークンレートでのデコードスループットも向上しています。@AravSrinivas はこれを、大規模 MoE をサービングするためのプレフィル/デコードのディスアグリゲーション（分離）を本質的に変えるものとして位置付けています。

推論オーケストレーションはますます専門化しており、「単なる Kubernetes」ではなくなっています。Modal は、計算リソース管理、クラウドネイティブキャッシング、CRIU、GPU チェックポイント化に関する取り組みを根拠に、推論には専用のスタックが必要だと主張しています。この位置づけは直ちに Perceptron によって実世界で支持され、同社は「ネイティブ動画、構造化出力、ハイブリッド推論が、珍しいコールドスタートとスケーリング要件を生み出すため、すべての Mk1 推論実行は Modal で行われている」と述べています。

オープンソース（OSS）の推論における経済性は急速に改善し続けています。SemiAnalysis の報告によると、RoCEv2 CX-7 を介して複数の B200 8-GPU マシンをクラスタリングし、PD（Processor-Disk）の分離を行うことで、1 GPU あたりのトークンスループットが最大 7 倍向上し、結果としてトークンあたりのコストも同程度に削減できると示唆されています。ベクトルデータベース側では、Qdrant 1.18 に TurboQuant が追加され、メモリ使用量を半分にしつつスカラー量子化に近い再現率を達成すると主張しています。これに加え、メモリの監視機能や名前付きベクトルのライフサイクル操作も提供されています。

エージェントランタイムはバージョン管理システムのような基盤へと進化しています。注目すべきシステムアイデアの一つが Stanford の Shepherd で、@ai_satoru_chan によって要約されました。これはエージェントの実行を Git に例え、第一級タスク、効果、スコープ、トレース、完全な再生実行、ブランチ作成、ロールバック、そして Lean による形式的保証を提供するものです。報告されている結果には、CooperBench におけるライブ監視の向上（28.8% から 54.7%）、より高速な反事実的最適化、およびツリー RL の展開が含まれます。

製品とモデルのリリース：マルチモーダル、動画、検索、埋め込み

Perceptron Mk1 は、このセットにおける最も実質的な新モデルリリースでした：@perceptroninc が Frontier Video（最先端動画）および Embodied Reasoning（具身推論）のモデルとして Perceptron Mk1 を立ち上げました。ネイティブ動画サポートは最大 2 FPS で、Temporal Grounding（時間的 grounding）、Multimodal In-Context Learning（多モーダル文脈内学習）、構造化された空間出力を備えています。OpenRouter のサマリーによると、32k の多モーダルコンテキストと、ポイント、ボックス、ポリゴン、クリップといった第一級の出力が特徴です。このリリースは汎用的な VLM（Vision-Language Model：視覚言語モデル）というよりは、物理世界推論スタックとして位置づけられています。

Google と Meta はともに、スタンドアロンのモデル仕様ではなく、多モーダルインタラクション層を強化しました：Google DeepMind の AI 搭載マウスポインターデモは、カーソルを Gemini に紐付いた文脈指向のポインティングインターフェースとして再定義し、ユーザーが画面上のコンテンツを指差して簡略化された指示を発話できるようにしています。並行して、Meta は Muse Spark を基盤とした Meta AI の音声会話機能を発表しました。これには中断機能、言語切り替え、画像生成、ライブカメラ接地型インタラクションが含まれています。

Embedding（埋め込み）および Retrieval（検索）モデルの更新も注目されました：Jina は、テキスト、画像、オーディオ、動画に対応するユニバーサルな埋め込みモデル「jina-embeddings-v5-omni」をリリースしました。1.57B 版と 0.95B 版があり、両方とも Matryoshka Truncation（マトリョーシカ型切り捨て）をサポートし、既存の v5-text インデックスとの後方互換性を保っています。Meta は静かに Sapiens2 をリリースしました。これはヒューマンセントリックな高解像度 ViT（Vision Transformer：ビジョントランスフォーマー）ファミリーで、0.1B から 5B パラメータまでをカバーし、ポーズ推定、セグメンテーション、法線ベクトル算出、ポイントマップ生成に対応しています。

Diffusion と画像ツールリングは引き続き進展：Hugging Face の Diffusers 0.38.0 では、Ace-Step 1.5、LongCat-AudioDiT、Ernie-Image を含む新しいパイプラインが追加され、Flash Attention 4、FlashPack ローディング、コンテキスト並列化のための Ring Anything へのサポートも提供されました。その他の研究リリースには、連続空間テキスト拡散モデルである ELF: Embedded Language Flows と、ピクセル整合型 3D 生成を可能にする Tencent の Pixal3D が含まれます。

エージェント、ツールリング、開発者ワークフロー

エージェント製品はデモから運用プラットフォームへと移行中：OpenAI は Symphony を発表し、これはすべての未完了タスクに実行中の Codex エージェントが割り当てられるシステムであり、同時に Codex がアプリ全体で完全乗っ取りなしに動作するためのコンピュータ操作機能も強調しました。LangChain は再オープンソース化した改訂版 Chat LangChain アプリを公開し、週約 2T トークンを処理する生産用 Q&A エージェントとして紹介しています。

長期実行型エージェントの状態管理が、主要なシステム課題へと昇格：LangGraph の新機能 DeltaChannel スナップショットは、スケーラブルで永続的な実行のために完全状態チェックポイントに代わるものを目指しており、LangChain は同様のメカニズムが deepagents v0.6 におけるメッセージ履歴とファイルストレージを駆動していると述べています。このより広範なパターンは、Google の Gemini Interactions API ガイドにも現れており、暗号化された思考署名により、状態あり・状態なしの両モードでターン間の推論コンテキストが保持され、開発者が手動で署名注入を管理する必要がないようになっています。

合成データと RL（強化学習）環境の生成が実用化段階にあります：@Vtrivedy10 は、有用な実践者の視点として、モデル重みからの標的型合成データ抽出はスケーラブルに困難であり、特に長文シーケンスなどの未表現分布においてはなおさらであること、そして効果的なパイプラインにはプログラムによるテスト、検証器、判定器、およびアジェンティックな長期ホライズンの枠組みが必要であることを指摘しました。インフラストラクチャ側では、Tau2-Infinity が失敗仮説からの DAG（有向非巡回グラフ）ウォークや世界生成を通じて、RL 後期トレーニング用のハードなツール使用タスクを自律的に採掘する形式化を行いました。

エンゲージメント上位のツイート（技術的関連性でフィルタリング）:

Isomorphic Labs の資金調達：@demishassabis は、AI 駆動型創薬のための新たな資金として 21 億ドルを公表しました。これは、適用 AI プラットフォームに直接結びついた本データセット内でも最大規模の資本コミットメントの一つです。

音声から音声へのベンチマーク：Artificial Analysis の τ-Voice ベンチマークでは、最良の S2S（Speech-to-Speech）モデルでさえ現実的なカスタマーサービスシナリオの約半分しか解決できていないことが判明しました。その中で、Grok Voice Think Fast 1.0 が 52.1% で首位に立っています。

Claude Opus 4.7 のファストモード：Anthropic のファストモードリリースが API および Claude Code に到達し、Cursor は 6 倍のコストで 2.5 倍の速度向上を報告しました。これはレイテンシと価格のフロンティアにおける具体的な新たな指標です。

セキュリティ、サプライチェーン、および安全なコーディング

今日の最も緊急性の高い運用上の出来事は、Mini Shai-Hulud サプライチェーン攻撃でした。@IntCyberDigest によると、このキャンペーンは TanStack に限定されず、npm や PyPI を通じて OpenSearch、Mistral AI、Guardrails AI、UiPath などへ拡大し、特に AI 開発者向けツールを標的としています。注目すべき技術的な詳細は永続性です。同攻撃は Claude Code (.claude/settings.json) および VS Code (.vscode/tasks.json) にフックを仕掛け、パッケージを削除した後も将来のツールイベントで侵害が再実行されるように設計されているとされています。Guardrails AI は後ほど、自社の 0.10.1 パッケージが侵害され、約 2 時間以内に隔離されたことを確認しました。

即座に実行可能な対策も提示されました。@ramimacisabird は、minimumReleaseAge の設定に加え、チームは blockExoticSubdeps を有効化して、リモート GitHub リファレンスが依存関係グラフに紛れ込むのを防ぐべきだと指摘しました。@elithrar は、GitHub の pull_request_target が、フォークベースの PR 自動化における最も危険な CI/CD の罠の一つであるとし、その重要性を再確認しました。またワークステーションレベルでは、@andersonbcdefg がシークレット情報を至る所に存在するローカルの .env ファイルから、適切なシークレットマネージャーへ移行することを推奨しました。

より安全なコード生成は独自の研究分野へと成長しています：スタンフォード大学と連携した SecureForge の取り組みでは、プロンプト最適化を通じて LLM 生成コードの脆弱性発見・防止を目標としており、対応する論文リストではこれをコード生成とセキュリティ評価をつなぐ架け橋として位置付けています。より広い視点から言えば、コーディングエージェントはすでに十分に強力になったため、サプライチェーンの強化や安全な生成の評価は、脇道の問題ではなく中核インフラとして扱う必要があるという点です。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP と長文コンテキストのローカル評価

Unsloth 上の MTP（アクティビティ数：727）: この画像は Hugging Face の活動スクリーンショットで、Unsloth AI が MTP を保持した GGUF ビルドを公開・更新している様子を示しています。具体的には unsloth/Qwen3.6-27B-GGUF-MTP と unsloth/Qwen3.6-35B-A3B-GGUF-MTP です。技術的な意義は、これらの GGUF が MTP（Multi-Token Prediction）/次トークン予測の補助層を保持している点ですが、ユーザーはデフォルトの llama.cpp のサポートに頼るのではなく、特定の llama.cpp MTP プルリクエストをチェックアウトしてビルドする必要があると報告されています。あるコメントではランタイムまたはモデル読み込みのアサーションエラー「GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0")」が発生したことが示されており、これらの MTP GGUF に対するツールやメタデータのサポートがまだ不安定であることを示唆しています。コメント投稿者の多くは、アップストリームの推論サポートを待っている状態であり、ある人は llama.cpp や vLLM の GitHub リポジトリを絶えず更新して確認していることを冗談めかして述べています。また、llama.cpp で MTP が「そのまま」サポートされているかどうかについても不透明さがあり、この投稿ではまだ対応していないことが示されています。

原文を表示

a quiet day.

AI News for 5/11/2026-5/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Research Benchmarks, Hard Evals, and Agentic Science Systems

Research-level reasoning benchmarks keep getting harder: Soohak introduces 439 research-level math problems authored from scratch by 64 mathematicians (including 38 faculty), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, @SophontAI released Medmarks v1.0, expanding its open medical benchmark suite from 20→30 benchmarks and 46→61 models. There’s also growing sentiment that old evals are saturating: @polynoamial argues benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.

Agentic systems are starting to move benchmark frontiers in science and math: Google DeepMind’s AI Co-Mathematician is described as an asynchronous, stateful research workbench for mathematicians, reportedly reaching 48% on FrontierMath Tier 4 while supporting ideation, literature discovery, computational analysis, theorem verification, and formal outputs. In theoretical physics, physics-intern boosts Gemini 3.1 Pro from 17.7% to 31.4% on CritPt via decomposition into specialized agents. On coding/program synthesis, ProgramBench’s first task was reportedly solved by GPT-5.5 high/xhigh, with xhigh outperforming Opus 4.7 xhigh across metrics.

Retrieval and search benchmarks are rewarding small, specialized models: LightOn’s Agent-ModernColBERT stacks another ~10% over Reason-ModernColBERT on BrowseComp-Plus while keeping the retriever at 149M parameters, with claims of matching or exceeding much larger model-based systems when paired with a generator. Related discussion from @xuzihuan4 asks whether lexical retrieval may suffice in agentic search loops when agents can iteratively refine their own queries.

Training, Optimization, and Scaling-Law Techniques

Optimizer work continues to compress training cost and improve small-scale experimentation: Several tweets centered on fast variants of SOAP/Muon-style updates. @torchcompiled applied tangent-step + Stiefel manifold retraction to SOAP basis updates, with follow-up discussion on drift checks and QR fallback for stability. In the Modded-NanoGPT community, SOAP-Muon set a new record at 3150 steps (-60), while an earlier MuLoCo-style outer Nesterov SGD wrap on NorMuonH also improved results, both backed by p-value reporting.

Formal methods and superoptimization are beginning to merge with ML systems work: @leloykun described a Lean4-to-TileLang tensor program superoptimizer that can automatically discover kernels such as FlashAttention2, FlashNorm, and split-k matmul, reporting roughly 1.8× geomean speedup on A100s. The same framework is positioned to jointly search over kernels, optimizers, hyperparameter transfer rules, and scaling laws.

Scaling laws and training metrics are being re-examined: @che_shr_cat argues the classic “20 tokens per parameter” framing is tokenizer-dependent and that scaling should be measured in bytes, not tokens. Separately, @JJitsev emphasized that prescriptive scaling laws are valuable not just for prediction, but as a systematic basis for comparing learning procedures across scales.

Training-time-only efficiency tricks are getting more interesting: Lighthouse Attention from Nous is highlighted as a subquadratic training wrapper around vanilla attention that can be removed near the end of training after a recovery phase, preserving standard deployment-time inference while reducing long-context pretraining cost. In a similar spirit, Renderers from Prime Intellect addresses the token/message impedance mismatch between RL trainers and agent environments, claiming >3× throughput on popular open models.

Inference Systems, Serving Stacks, and Runtime Infrastructure

Blackwell racks are emerging as the reference platform for large-MoE serving: Perplexity published details on serving post-trained Qwen3 235B on NVIDIA GB200 NVL72 systems, arguing GB200 is a major inference step up over Hopper for large MoEs. Their benchmarks cite NVLS all-reduce latency dropping from 586.1µs on H200 to 313.3µs on GB200, and MoE prefill combine at EP=4 dropping from 730.1µs to 438.5µs, with better decode throughput at high token rates. @AravSrinivas framed this as materially changing prefill/decode disaggregation for serving large MoEs.

Inference orchestration is increasingly specialized, not “just Kubernetes”: Modal argues inference needs a dedicated stack, citing work on compute management, cloud-native caching, CRIU, and GPU checkpointing. That positioning got an immediate real-world endorsement from Perceptron, which said all Mk1 inference runs on Modal because native video, structured outputs, and hybrid reasoning create unusual cold-start and scaling requirements.

OSS inference economics continue to improve fast: SemiAnalysis reported that clustering multiple B200 8-GPU machines over RoCEv2 CX-7 with PD disaggregation can lift per-GPU token throughput by up to 7×, implying comparable cost-per-token reductions. On the vector DB side, Qdrant 1.18 added TurboQuant, claiming recall near scalar quantization with 2× less memory, alongside memory monitoring and named-vector lifecycle operations.

Agent runtimes are becoming version-control-like substrates: A standout systems idea was Stanford’s Shepherd, summarized by @ai_satoru_chan, which treats agent execution more like Git: first-class tasks, effects, scopes, and traces; exact replay; branching; rollback; and formal guarantees in Lean. Claimed results include live-supervision gains on CooperBench from 28.8%→54.7%, plus faster counterfactual optimization and tree-RL rollouts.

Product and Model Releases: Multimodal, Video, Retrieval, and Embeddings

Perceptron Mk1 was the most substantive new model release in the set: @perceptroninc launched Perceptron Mk1 as a model for frontier video and embodied reasoning, with native video support at up to 2 FPS, temporal grounding, multimodal in-context learning, and structured spatial outputs. OpenRouter’s summary notes a 32k multimodal context and first-class outputs like points, boxes, polygons, and clips. The release is framed less as a generic VLM and more as a physical-world reasoning stack.

Google and Meta both pushed multimodal interaction layers rather than standalone model specs: Google DeepMind’s AI-enabled mouse pointer demos reimagine the cursor as a contextual pointing interface tied to Gemini, allowing users to point at on-screen content and speak shorthand instructions. In parallel, Meta announced Meta AI voice conversations powered by Muse Spark, adding interruption, language switching, image generation, and live camera-grounded interaction.

Embedding and retrieval model updates were notable: Jina released jina-embeddings-v5-omni, a universal embedding model for text, images, audio, and video, in 1.57B and 0.95B variants, both with Matryoshka truncation and backward compatibility with existing v5-text indexes. Meta quietly released Sapiens2, a family of human-centric high-resolution ViTs spanning 0.1B→5B params for pose estimation, segmentation, normals, and pointmaps.

Diffusion and image tooling kept moving: Hugging Face’s Diffusers 0.38.0 added new pipelines including Ace-Step 1.5, LongCat-AudioDiT, and Ernie-Image, plus support for Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Other research releases included ELF: Embedded Language Flows, a continuous-space text diffusion model, and Tencent’s Pixal3D for pixel-aligned 3D generation.

Agents, Tooling, and Developer Workflow

Agent products are shifting from demos to operational platforms: OpenAI teased Symphony as a system where every open task gets a running Codex agent, and separately highlighted computer use for Codex to work across apps without full takeover. LangChain re-open-sourced its revamped Chat LangChain app, describing it as a production Q&A agent handling nearly 2T tokens/week.

Long-running-agent state management is becoming a first-class systems problem: LangGraph’s new DeltaChannel snapshots aim to replace full-state checkpointing for scalable durable execution; LangChain says the same mechanism now powers message histories and file storage in deepagents v0.6. The broader pattern also shows up in Google’s Gemini Interactions API guide, where encrypted thought signatures preserve reasoning context across turns in both stateful and stateless modes without forcing developers to manage signature injection manually.

Synthetic data and RL environment generation are being operationalized: @Vtrivedy10 offered a useful practitioner perspective: targeted synthetic data extraction from model weights is hard at scale, especially for underrepresented distributions like long sequences, and effective pipelines need programmatic tests, verifiers, judges, and agentic long-horizon framing. On the infrastructure side, Tau2-Infinity formalizes autonomous mining of hard tool-use tasks for RL post-training via DAG walks or world-generation from failure hypotheses.

Top tweets (by engagement, filtered for technical relevance):

Gemini as an OS-level intelligence layer: Google’s Gemini Intelligence, Googlebook, and AI pointer demos collectively point to agentic UX moving from chat windows into the operating system.

Isomorphic Labs funding: @demishassabis announced $2.1B in new funding for AI-driven drug discovery, one of the largest capital commitments in this dataset tied directly to an applied AI platform.

Speech-to-speech benchmarking: Artificial Analysis’ τ-Voice benchmark found even the best S2S models solve only about half of realistic customer service scenarios, with Grok Voice Think Fast 1.0 leading at 52.1%.

Claude Opus 4.7 fast mode: Anthropic’s fast mode release reached APIs and Claude Code, with Cursor noting 2.5× speed at 6× cost, a concrete new point on the latency/price frontier.

Security, Supply Chain, and Safer Coding

The most urgent operational story was the Mini Shai-Hulud supply-chain attack: @IntCyberDigest reported the campaign had expanded beyond TanStack to hit OpenSearch, Mistral AI, Guardrails AI, UiPath, and others across npm and PyPI, specifically targeting AI developer tooling. The noteworthy technical detail is persistence: it allegedly hooks into Claude Code (.claude/settings.json) and VS Code (.vscode/tasks.json) so the compromise can re-execute on future tool events even after package removal. Guardrails AI later confirmed its 0.10.1 package was compromised and quarantined within about 2 hours.

Actionable mitigations surfaced quickly: @ramimacisabird noted that beyond minimumReleaseAge, teams should enable blockExoticSubdeps to prevent remote GitHub references from slipping into dependency graphs. @elithrar reiterated that GitHub’s pull_request_target remains one of the sharpest CI/CD footguns for fork-based PR automation. And at the workstation level, @andersonbcdefg recommended moving secrets out of ubiquitous local .env files into a proper secrets manager.

Safer codegen is becoming its own research track: Stanford-aligned work on SecureForge targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, while the corresponding paper listing frames it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

MTP on Unsloth (Activity: 727): The image is a Hugging Face activity screenshot showing Unsloth AI publishing/updating MTP-preserved GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The technical significance is that these GGUFs retain the MTP / next-token-prediction auxiliary layer, but users reportedly still need to checkout and build a specific llama.cpp MTP PR rather than relying on default llama.cpp support. One commenter hit a runtime/model-load assertion, GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting tooling or metadata support is still fragile for these MTP GGUFs. Commenters are mainly waiting on upstream inference support, with one joking about constantly refreshing llama.cpp and vLLM GitHub repos. There is also uncertainty over whether MTP is supported “out of the box” in llama.cpp; the post indicates it is not yet.

Several commenters are tracking whether llama.cpp and vLLM have landed native MTP support, with one explicitly asking whether llama.cpp

この記事をシェア

TLDR AI重要度42026年6月29日 09:00

Qwen Image Agent（12 分読み）

TLDR AI重要度42026年6月30日 09:00

RoadmapBench：バージョンアップを跨ぐ長期的エージェント型ソフトウェア開発の評価

TLDR AI2026年6月30日 09:00

30 分で文書処理ワークフローを構築する方法（3 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP と長文コンテキストのローカル評価

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI ツイートリキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP と長文コンテキストのローカル評価

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

関連記事