Smol AI News·2026年5月29日 14:44·約16分で読める

今日は何も大きな出来事はありませんでした

#LLM #Claude #Anthropic #Agent #Benchmark

TL;DR

Claude Opus 4.8 のリリースにより、ベンチマークでは混合評価となったが、実用上の協働性やシステム指令機能は向上し、API 価格への不満も残る重要な品質改善版である。

AI深層分析2026年5月30日 01:01

重要/ 5段階

深度40%

キーポイント

Claude Opus 4.8 の性能評価とトレードオフ

フロントエンド・コードテストでは効率性が向上したが、文書解析におけるコンテンツの忠実性やチャート生成で後退が見られ、全体的に「画期的な突破」ではなく「漸進的な改善」と評価されている。

プラットフォーム機能とコスト構造

会話中のシステム指令やロール更新機能が追加され長期間のエージェントセッションに適しているが、API 価格の高さが課題として指摘されており、経済性の観点からは競合他社に軍配が上がる場合もある。

実用性における品質向上

コード生成において過剰な自律行動（over-agentic）が減り、より協調的な振る舞いを見せるなど、実際の業務利用におけるユーザー体験の向上が確認された。

影響分析・編集コメントを表示

影響分析

このリリースは、AI モデルの開発が単なるベンチマークスコアの向上から、実環境での安定性とコスト効率への転換期にあることを示唆しています。開発者や企業は、特定のタスクにおける性能低下リスクを認識しつつも、より協調的な振る舞いを実装する価値を見出し、モデル選定基準を再構築する必要に迫られています。

編集コメント

ベンチマークの数値上の伸び悩みが目立ちますが、現場の使い勝手やコスト構造を踏まえると、実務導入における重要な転換点となるニュースです。開発者は「スコア至上主義」から「実用性重視」への視点切り替えを迫られています。

静かな一日。

2026年5月28日〜29日のAIニュース。12のサブレッド、544 の Twitter、および追加の Discord チャンネルを確認しました。AINews のウェブサイトでは過去のすべての号を検索できます。念のため、AINews は現在 Latent Space のセクションの一部となっています。メール配信頻度のオプトイン/オプトアウトが可能です！

AI Twitter リキャップ

Claude Opus 4.8 の展開、ベンチマークの摩擦、API の使いやすさ

Opus 4.8 はノイズが多く評価が分かれる状況に投入されました。複数の独立したベンチマークは「改善はあるが支配的ではない」という結論に収束しました。@arena は Opus 4.8 を以前の Opus バリアント、Gemini、GLM と比較するフロントエンド/コードテストを200件以上実施し、@theo は CursorBench で 4.7 より効率的だが誤差の範囲内でわずかに劣ると報告しました。@jerryjliu0 と @llama_index は表やレイアウトで小さな改善が見られる一方、文書解析におけるコンテンツの忠実度やチャートで後退が確認されたと指摘しました。@scaling01 は ALE-Bench での進展はないとし、別途 LisanBench で興味深い失敗モードを指摘しました。肯定的な側面として、@jeremyphoward はコーディングにおいて 4.8 が 4.7 や GPT-5.5 よりも過剰に自律的ではなく、より協調的であると発見しました。また @leo_linsky は、これは以前の Anthropic のリリースに対する実質的な製品改善であると評価しています。

Anthropic もまた、有用なプラットフォームレベルの変更をリリースしました：@ClaudeDevs は会話中にシステム指示を発表し、プロンプトキャッシュを破綻させることなく、権威ある会話中のシステムロール更新を実現しました。これは長時間実行されるエージェントセッションやコスト管理において重要です。しかし、価格設定は依然として大きな不満の種となっています：@jeremyphoward は、Anthropic が API の手頃さに対してほとんど手を打っていないと主張し、GPT-5.5 を部分的に好むのは、サブスクリプション/API の経済性がより正当化しやすいからです。全体的な教訓：4.8 は、実用的な利用においてはベンチマークの完全なリセットではなく、意味のある生活の質を向上させるリリースのように見えます。

エージェント・ハーネス、多ターン RL バグ、そして自律性を取り巻くインフラストラクチャ

微妙だが重要な強化学習（RL: Reinforcement Learning）の失敗モードが指摘されました：@ClementDelangue は、多くのツール使用型・多ターン RL 学習ループが静かに破綻している理由を解説する Hugging Face の詳細分析を紹介しました。核心的なバグは、モデル出力のデコード、ツール呼び出しのパース、そして更新された会話の再トークン化を行うことで、トークン化が変化し、勾配がモデルが実際にサンプリングしたシーケンスに対して適用されてしまう点です。提案される解決策は、厳格な「Token-In, Token-Out」ルールです：サンプリングされたトークンを再エンコードしてはいけません。ターン全体で単一のトークンバッファを維持します。@johnschulman2 は、レンダラーがメッセージとトークンの間の基盤となるインフラストラクチャであり、失敗モードには訓練/テストの不整合、キャッシングの非効率性、プロンプト注入リスクが含まれるというより広範な点を強調しました。

ハーネス設計は独自の最適化分野へと進化しています：@omarsar0 が Effective Feedback Compute (EFC) に関する研究を紹介し、生のトークン数やツール使用回数ではエージェントの成功を説明する能力が低いと主張しました。一方、EFC は R²値が最大 0.99 に達し、これはハーネスの品質が総活動量よりも重要であることを示唆しています。これは @LangChain のような製品化されたチューニング取り組みとも一致しており、Deep Agents v0.6 では Qwen/Kimi/DeepSeek といったモデルから、最先端 API と比較して 20 倍以上のコスト削減で強力なパフォーマンスを得るために、ハーネスプロファイルをファーストクラスとして扱っています。また @hwchase17 は「異なるモデルには異なるプロンプトやツールが必要である」と明確に位置付けています。@vllm_project はネイティブのウェイト同期 API を実装し、非同期 RL における一時停止/再開機能を改善しました。その後、fastokens（Rust ベースの BPE トークナイザー）を追加し、長文コンテキストやエージェントワークロードにおける CPU のトークン化ボトルネックを削減しました。

議論は「シングル vs マルチエージェント」から、「どこで抽象化が効果を発揮するか」へとシフトしています：@OfirPress は、現在のマルチエージェントシステムは主に速度向上のものであり、能力の解放ではないと主張しました。一方、@scaling01 はこれとは対照的な見解を持ち、群れのようなトレーニングスタイルがより優れた計画や超知能のような振る舞いを生み出すと期待しています。いずれにせよ、実用的なトレンドは明確です：より多くのチームがエージェントの観測可能性、トレース、継続的改善ループを中心に構築しており、例えば @Vtrivedy10 は生産環境からのトレースを SFT（Supervised Fine-Tuning）や蒸留、長期ホライゾンの継続学習のために活用する取り組みを行っています。

オープンモデル、ローカル AI、そして OSS ツールチェーンの強化

ローカルファーストおよびオープンウェイトの勢いは引き続き高まっています：@LangChain は、2026 年 4 月には AI チームの 3 分の 1 がオープンウェイトモデルを実行していると発表し、9 ヶ月前の 5 分の 1 から増加しました。一方、@EpochAIResearch の推計では、オープンウェイトモデルは最先端のプロプライエタリモデルに約 4 ヶ月遅れているとのことです。ツールチェーン側では、@ggerganov が llama.app を立ち上げ、llama.cpp に公式ウェブサイト、統一されたインストーラー、およびローカルデプロイとサードパーティ製エージェントの統合を容易にするための単一の llama エントリーポイントを提供しました。また、@ollama は Ollama を通じて OpenJarvis を発表し、これは Stanford/Hazy の「ワットあたりの知能（Intelligence Per Watt）」という枠組みに明示的に結びついたローカルファーストのパーソナル AI です。

オープンインフラストラクチャはよりエンタープライズ志向へと変化しています：@ClementDelangue は、Hugging Face 上のモデルとデータセットの約 50% が現在プライベート化されており、HF のストレージ/バケット提供に伴って増加していると指摘しました。これは、Hugging Face が単なるパブリック OSS インフラストラクチャであるという考えに対する重要な是正です。@abidlabs は、CPU サーバーレス GPU CI において GitHub Runners に代わって Hugging Face Jobs を使用している事例を示しました。また、@DSPyOSS、@dbreunig、および他の関係者は、4.0 のリリースに先駆けて DSPy のドキュメントとトップページを再設計して公開し、純粋なプロンプトエンジニアリングではなく、プログラム可能な AI システムへのオンボーディングに焦点を当てました。

ライセンスと許容性は戦略的なレバーになりつつあります：@kimmonismus は、NVIDIA が 4 つのオープンモデルファミリーを Linux Foundation OpenMDW-1.1 に移行し、重み・コード・ドキュメント・データにわたる法的な断片化を削減した点を指摘しました。新しい許容性のあるデータリリースも重要です：@keshigeyan は、視覚生成用の明示的な研究および商用利用が可能である 100M ペアの許容画像コーパスと 1M ペアのベンチマークを含む GPIC を紹介しました。

Google/OpenAI の製品展開拡大：管理型エージェント、Gemini Spark/Omni、Windows 上の Codex

Google は「管理型エージェント」のスタックを API から消費者向け製品へと広げています：@_philschmid は Gemini API における Managed Agents を紹介しました。これは単一の API 呼び出しで、コード実行・ウェブアクセス・ファイル入出力を備えたサンドボックス化された Linux 環境をプロビジョニングするものです。消費者向け側では、@GeminiApp が Gemini Spark を米国の AI Ultra サブスクライバーに向けて展開し、ユーザーの指示のもとデジタルエコシステム全体で動作できる 24 時間年中無休のパーソナルエージェントとして機能しています。Google はまた、Gemini Omni のマルチモーダル生成・編集デモ（例：製品スレッド）を継続して推進し、動画・映画制作におけるクリエイティブワークフロー向けの Google Flow Agent を発表しました（スレッド）。

OpenAI の Codex は、永続的なリモート開発オペレーターに近づいています：@OpenAI と @OpenAIDevs が Windows でのコンピュータ操作機能を追加し、ChatGPT モバイルアプリからの遠隔操縦も可能になりました。続く UX 改善には、バックグラウンドエージェント用の安定したアイデンティコン（identicons）や、過去のチャットコンテンツ全体を検索する機能が含まれています（@OpenAIDevs）。また、@reach_vb が Windows コントロール、モバイルによる遠隔アクセス、プロフィールおよびタスク統計に関する広範な Codex の更新を要約しました。別に OpenAI は、@michpokrass によると、迎合性（sycophancy）、事実の正確さ、多言語性能を向上させるために gpt-5.5 instant をアップデートしました。

これらはすべて、より垂直統合されたエージェントスタックへの道を示しています：モデル + ハーネス + サンクボックス + UI + 遠隔制御 + 価格/クォータ。Google は Gemini のクォータ調整をスムーズ化しており（@joshwoodward）、OpenAI は Codex の運用範囲を拡大し、Cursor はサブエージェントベースの承認ルーティングを備えた自動レビューモードを追加しました（ツイート）。共通するパターンは、「チャットボット」から離れ、ポリシーとメモリを持つ管理された実行環境へと移行している点です。

注目に値する研究およびシステム論文

検索、情報取得、および記憶：@TheTuringPost は、ハーバード大学と MIT が発表した双方向進化探索（Bidirectional Evolutionary Search: BES）を紹介しました。これは前方探索と後方分解、そして進化的演算子を組み合わせた手法です。報告された成果には、MuSiQue における Llama-3.2-3B-Instruct の性能が 4.0% から 7.0% に向上したことが含まれます。情報取得の分野では、@_reachsumit が Latent Terms を指摘し、凍結された密型検索器から SAE（Sparse Autoencoder）を用いて BM25 対応のスパース特徴量を抽出できることを示しました。また、@topk_io はより効率的な後期相互作用推論を実現する Iso-ModernColBERT をオープンソース化しました。

継続学習と信念・状態管理：@HuggingPapers は BeliefTrack を要約し、最適化された信念状態管理により長期の推論失敗が 70% 以上削減されると主張しました。@AndrewLampinen は、継続学習分野が干渉に過度に焦点を当てており、正の転移（positive transfer）への注目が不足していると指摘しました。一方、@victor207755822 は、自己反復と継続学習（CL）に焦点を当てた DeliAutoResearch の SKILL 論文の第 2 報を発表しました。

マルチモーダル/世界モデル/ロボティクス：NVIDIA に所属する研究者らの成果として、γ-World が挙げられます。これは 1 秒あたり 24 フレームでストリーミングされる生成型マルチエージェント世界モデルです（ツイート）。また、minWM はリアルタイム対話型ビデオ世界モデルのフレームワークです（ツイート）。ロボティクス分野では、@_akhaliq が Qwen-VLA を共有し、@inventorOli は Robostral の言語追従能力と操作性能の向上をデモしました。常時オンで能動的に動作するエージェントについては、@dair_ai が LLM の起動判断を 220MiB の時系列グラフエンコーダに置き換えた研究を紹介し、平均 F1 スコアが +16.7 向上するとともに、実行速度が 4〜83 倍高速化されたことを示しました。

エンゲージメント上位のツイート

OpenAI / バイオロジー：@OpenAI は、公衆衛生と生物防衛のための信頼できるアクセス用バイオツールを備えた「Rosalind Biodefense」を発表しました。

Google / コンシューマーエージェント：@GeminiApp が、米国の AI Ultra ユーザー向けに常時稼働型のパーソナルエージェント「Spark」をリリースしました。

OpenAI / 開発者ツール：@OpenAI は Codex の Windows サポートを発表し、@OpenAIDevs はコンピューター操作機能を Windows およびモバイルの遠隔操縦へと拡張しました。

llama.cpp UX マイルストーン：@ggerganov が、ローカル AI 向けの統合インストーラーと CLI エントリーポイントを持つ「llama.app」をリリースしました。

HF / RL の正しさ：@ClementDelangue は、ツールを用いた多ターン強化学習（RL）における「Token-In, Token-Out」の警告を強調しました。

オープンとクローズドのタイミング格差：@EpochAIResearch によると、オープンウェイトモデルは現在、最先端モデルより約 4 ヶ月遅れていると推定されています。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. ローカル LLM パフォーマンス：MoE リリース、量子化、VRAM 節約

StepFun 3.7 Flash (アクティビティ：637): StepFun は、総パラメータ数 196B、アクティブパラメータ数 11B、内蔵 ViT（Vision Transformer）1.8B を備えたマルチモーダル MoE（Mixture of Experts）である Step 3.7 Flash をリリースしました。これは最大 400 TPS の高スループットエージェントワークフロー向けに宣伝されており、約 128GB の RAM でローカル実行可能と報告されています。報告されたベンチマークでは、フラッシュクラス/ローカルモデルとしては異例の強さを示しており、SWE-Bench Pro は 56.26%、DeepSearchQA F1 は 92.82%、HLE w/tools（ツール付き）は 47.2 です。また、Terminal-Bench、Toolathlon、ClawEval、およびその他のエージェント/ツール使用タスクにおいて、Step 3.5 Flash から大幅な向上が見られます。直接モデルアーティファクトは Hugging Face で BF16、FP8、NVFP4、GGUF の形式で利用可能であり、llama.cpp への day-0 サポート PR および関連する MTP（Multi-Token Prediction）作業が llama.cpp#23274 にあります。コメント投稿者たちはこのモデルを技術的に奇妙であると特徴付けています：その隠れ状態や思考の痕跡はほぼ無意味である一方、最終回答は「完璧」であり、1TB を超えるはるかに大きなモデルと競合するほどです。あるユーザーは、以前の Step 3.5 の「無限思考」の問題が修正されたようだと述べています。ローカル展開については慎重な期待感が広がっており、特に 4x3090 クラスのハードウェアを備えたユーザーにとって歓迎すべきことです。また、StepFun がフォークのみを維持するのではなく、llama.cpp へのサポートをアップストリームに提供した点も評価されています。

StepFun は Hugging Face に複数の Step-3.7-Flash チェックポイント（BF16: Step-3.7-Flash、FP8: Step-3.7-Flash-FP8、NVFP4: Step-3.7-Flash-NVFP4、GGUF: Step-3.7-Flash-GGUF）をリリースしました。あるユーザーは、以前の Step 3.5 Flash の「無限思考」の問題が修正されたようだと報告しており、中間的な推論スタイルにまだ奇妙さがあるにもかかわらず、3.7 はより実用可能になっていると述べています。

StepFun のアップストリーム PR (ggml-org/llama.cpp#23845) により、day-0 の llama.cpp 対応が実現されました。これは Step 3.5 のフォークベースのサポートとは対照的です。MTP サポートのための別のコミュニティ PR は ggml-org/llama.cpp#23274 に存在しますが、コメント投稿者らはこれが Step 3.7 と現在の master ブランチに合わせて更新が必要であると指摘しています。

NVFP4 チェックポイントの vLLM ナイトリーテストでは、2x Pro 6k 環境で 64 の並列な浅文脈リクエストを実行した結果、約 2200 tok/s を達成しました。報告された設定は、tensor-parallel-size 2、--enable-expert-parallel、--quantization modelopt、--kv-cache-dtype fp8、--reasoning-parser step3p5、および StepFun のツール呼び出しパーサーを使用するものでした。vLLM は GPU KV キャッシュサイズが 1,667,645 トークン、262,144 トークン/リクエストあたりの最大並列度が 6.36 倍であると報告しました。

Qwen 35B が LM Studio で 12GB の VRAM で動作し、120+ トークン/秒の速度を達成。Cline と連携して 100% エージェント型コーディングが可能。 (アクティビティ: 387): この投稿は、Qwen3.6-35B-A3B が LM Studio で RTX 3080 Ti (12GB VRAM) 上で動作可能であり、DanyDA/unsloth_Qwen3.6-35B-A3B-UD-IQ1_M-GGUF-SPLIT https://huggingface.co/DanyDA/unsloth_Qwen3.6-35B-A3B-UD-IQ1_M-GGUF-SPLIT という分割された GGUF 量子化モデルを使用することで、120+ tok/s の速度を達成できると主張しています。すべてのレイヤーが GPU にオフロードされ、K/V キャッシュの量子化は Q4_0 に設定されており、 claimed 128k cont

原文を表示

a quiet day.

AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics

Opus 4.8 landed into a noisy, mixed eval landscape: multiple independent benches converged on “incremental but not dominant.” @arena pushed 200+ frontend/code tests comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; @theo reported CursorBench shows it as more efficient but slightly worse than 4.7 within margin of error; @jerryjliu0 and @llama_index found small gains on tables/layout but regressions on content faithfulness/charts in document parsing; @scaling01 said no progress on ALE-Bench and separately flagged interesting failure modes on LisanBench. On the positive side, @jeremyphoward found 4.8 less over-agentic and more cooperative than 4.7/GPT-5.5 in coding, while @leo_linsky called it a tangible product improvement over prior Anthropic releases.

Anthropic also shipped useful platform-level changes: @ClaudeDevs announced mid-conversation system instructions without breaking prompt cache, plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: @jeremyphoward argued Anthropic has done little for API affordability, preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset.

Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy

A subtle but important RL failure mode got called out: @ClementDelangue highlighted a Hugging Face deep-dive on why many tool-using, multi-turn RL training loops are silently broken. The core bug: decoding model output, parsing tool calls, then re-tokenizing the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict “Token-In, Token-Out” rule: never re-encode sampled tokens; keep a single token buffer across turns. @johnschulman2 reinforced the broader point that renderers are foundational infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk.

Harness design is becoming its own optimization discipline: @omarsar0 surfaced work on Effective Feedback Compute (EFC), claiming raw token/tool counts explain agent success poorly while EFC reaches R² up to 0.99, implying harness quality matters more than gross activity. This lines up with productized tuning efforts like @LangChain, where Deep Agents v0.6 makes harness profiles first-class to get strong performance from Qwen/Kimi/DeepSeek at 20x+ lower cost than frontier APIs, and @hwchase17 explicitly framing “different models need different prompts/tools.” @vllm_project shipped native weight syncing APIs and improved pause/resume for async RL, and later added fastokens, a Rust BPE tokenizer to reduce CPU tokenization bottlenecks in long-context/agentic workloads.

Debate is shifting from “single vs multi-agent” to where the abstraction pays: @OfirPress argued current multi-agent systems are mostly speedups, not capability unlocks; @scaling01 took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around agent observability, traces, and continual improvement loops, e.g. @Vtrivedy10 on mining production traces for SFT/distillation and long-horizon continual learning.

Open Models, Local AI, and the OSS Toolchain Tightening Up

Local-first and open-weight momentum continues to rise: @LangChain said 1 in 3 AI teams ran an open-weights model in April 2026, up from 1 in 5 nine months earlier; @EpochAIResearch estimated open-weight models now lag frontier proprietary models by about four months. On the toolchain side, @ggerganov launched llama.app, giving llama.cpp an official website, a unified installer, and a single llama entrypoint aimed at easier local deployment and third-party agent integration. @ollama announced OpenJarvis as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy’s “Intelligence Per Watt” framing.

Open infrastructure is getting more enterprise-shaped: @ClementDelangue noted that ~50% of models and datasets on Hugging Face are now private, rising with HF’s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. @abidlabs showed Hugging Face Jobs replacing GitHub runners for CPU/serverless GPU CI. @DSPyOSS, @dbreunig, and others shipped a redesigned DSPy docs/front page ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting.

Licensing and permissiveness are becoming strategic levers: @kimmonismus highlighted NVIDIA moving its four open model families to Linux Foundation OpenMDW-1.1, reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: @keshigeyan introduced GPIC, a 100M-pair permissive image corpus plus 1M-pair benchmark for visual generation, with explicit research + commercial usability.

Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows

Google is widening the “managed agent” stack from API to consumer product: @_philschmid showed Managed Agents in the Gemini API: a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, @GeminiApp rolled out Gemini Spark to U.S. AI Ultra subscribers as a 24/7 personal agent that can operate across a user’s digital ecosystem under direction. Google also kept pushing Gemini Omni multimodal generation/editing demos (example, product thread) and announced Google Flow Agent for creative workflows in video/film production (thread).

OpenAI’s Codex is moving closer to a persistent remote dev operator: @OpenAI and @OpenAIDevs added computer use on Windows, including remote steering from the ChatGPT mobile app. Follow-on UX improvements included stable identicons for background agents and search across prior chat content (@OpenAIDevs); @reach_vb summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated gpt-5.5 instant to improve sycophancy, factuality, and multilingual performance per @michpokrass.

This all points to more vertically integrated agent stacks: model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini (@joshwoodward); OpenAI is expanding Codex’s operating surface; Cursor added auto-review mode with subagent-based approval routing (tweet). The common pattern is less “chatbot,” more managed execution environment with policy and memory.

Research and Systems Papers Worth Attention

Search, retrieval, and memory: @TheTuringPost highlighted Bidirectional Evolutionary Search (BES) from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0%. In retrieval, @_reachsumit pointed to Latent Terms, showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. @topk_io open-sourced Iso-ModernColBERT for more efficient late-interaction inference.

Continual learning and belief/state management: @HuggingPapers summarized BeliefTrack, claiming optimized belief-state management cuts long-horizon reasoning failures by 70%+. @AndrewLampinen argued the continual learning field over-focused on interference instead of positive transfer; @victor207755822 presented a second DeliAutoResearch SKILL paper focused on self-iteration and CL.

Multimodal/world models/robotics: NVIDIA-affiliated work included γ-World, a generative multi-agent world model streaming at 24 FPS (tweet), and minWM, a real-time interactive video world model framework (tweet). In robotics, @_akhaliq shared Qwen-VLA, and @inventorOli demoed Robostral’s language-following and manipulation improvements. For always-on proactive agents, @dair_ai surfaced work replacing LLM wake-up decisions with a 220MiB temporal-graph encoder, gaining +16.7 mean F1 while running 4–83x faster.

Top tweets (by engagement)

OpenAI / biology: @OpenAI on Rosalind Biodefense announced trusted-access biology tooling for public health and biodefense.

Google / consumer agents: @GeminiApp on Spark rolled out its always-on personal agent to AI Ultra users in the U.S.

OpenAI / dev tools: @OpenAI on Codex Windows support and @OpenAIDevs expanded computer use to Windows plus mobile remote steering.

llama.cpp UX milestone: @ggerganov launched llama.app with a unified installer and CLI entrypoint for local AI.

HF / RL correctness: @ClementDelangue amplified the Token-In, Token-Out warning for multi-turn RL with tools.

Open vs closed timing gap: @EpochAIResearch estimated open-weight models are now about 4 months behind the frontier.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Performance: MoE Releases, Quants, VRAM Savings

StepFun 3.7 Flash (Activity: 637): StepFun released Step 3.7 Flash, a multimodal MoE with 196B total parameters, 11B active, and a built-in 1.8B ViT, advertised for high-throughput agent workflows up to 400 TPS and reportedly runnable locally with ~128GB RAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro 56.26%, DeepSearchQA F1 92.82%, HLE w/tools 47.2, plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in BF16, FP8, NVFP4, and GGUF, with day-0 llama.cpp support PR and related MTP work in llama.cpp#23274. Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be “perfect” and competitive with much larger >1TB models; one user says the prior Step 3.5 “infinite thinking” issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with 4x3090-class hardware, and appreciation that StepFun upstreamed llama.cpp support instead of only maintaining a fork.

StepFun released multiple Step-3.7-Flash checkpoints on Hugging Face: BF16 (Step-3.7-Flash), FP8 (Step-3.7-Flash-FP8), NVFP4 (Step-3.7-Flash-NVFP4), and GGUF (Step-3.7-Flash-GGUF). One user reports the prior Step 3.5 Flash “infinite thinking” issue appears fixed, making 3.7 more usable despite still having an odd intermediate reasoning style.

There is day-0 llama.cpp enablement via StepFun’s upstream PR: ggml-org/llama.cpp#23845, contrasting with Step 3.5’s fork-based support. A separate community PR for MTP support exists at ggml-org/llama.cpp#23274, though commenters note it needs updating for Step 3.7 and current master.

A vLLM nightly test of the NVFP4 checkpoint on 2x Pro 6k with 64 concurrent shallow-context requests reached about 2200 tok/s. The reported config used tensor-parallel-size 2, --enable-expert-parallel, --quantization modelopt, --kv-cache-dtype fp8, --reasoning-parser step3p5, and StepFun tool-call parsing; vLLM reported GPU KV cache size 1,667,645 tokens and max concurrency 6.36x for 262,144 tokens/request.

Qwen 35B running on 12gb of VRAM in LM Studio at 120+ tokens/second. Works with Cline for 100% agentic coding. (Activity: 387): The post claims Qwen3.6-35B-A3B can run in LM Studio on an RTX 3080 Ti (12GB VRAM) at 120+ tok/s using the split GGUF quant DanyDA/unsloth_Qwen3.6-35B-A3B-UD-IQ1_M-GGUF-SPLIT, with all layers offloaded to GPU and both K/V cache quantization set to Q4_0 to fit a claimed 128k cont

この記事をシェア

Smol AI News★42026年5月26日 14:44

今日は何も大きな出来事はありませんでした

Smol AI News は、5 月 26 日から 27 日にかけての期間に、12 のサブレッドや 544 件のツイートを調査しましたが、特に注目すべき AI テクノロジー関連のニュースは発生しませんでした。

The Zvi★42026年5月30日 05:49

Claude Opus 4.8：システムカードの発表

Anthropic は Claude Opus 4.7 からわずか6週間で、より賢く長時間タスクを実行可能な新バージョン「Opus 4.8」を発表し、244ページのシステムカードを公開した。

Simon Willison Blog★42026年5月29日 08:59

Claude Opus 4.8：「控えめだが実感のある改善」

Anthropic が新モデル Claude Opus 4.8 を公開し、前作よりコスト削減に向けた開発も進めていると発表した。

ニュース一覧に戻る元記事を読む

Smol AI News·2026年5月29日 14:44·約16分で読める

今日は何も大きな出来事はありませんでした

#LLM #Claude #Anthropic #Agent #Benchmark

TL;DR

AI深層分析2026年5月30日 01:01

重要/ 5段階

深度40%

キーポイント

Claude Opus 4.8 の性能評価とトレードオフ

プラットフォーム機能とコスト構造

実用性における品質向上

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI Twitter リキャップ

Claude Opus 4.8 の展開、ベンチマークの摩擦、API の使いやすさ

Opus 4.8 はノイズが多く評価が分かれる状況に投入されました。複数の独立したベンチマークは「改善はあるが支配的ではない」という結論に収束しました。@arena は Opus 4.8 を以前の Opus バリアント、Gemini、GLM と比較するフロントエンド/コードテストを200件以上実施し、@theo は CursorBench で 4.7 より効率的だが誤差の範囲内でわずかに劣ると報告しました。@jerryjliu0 と @llama_index は表やレイアウトで小さな改善が見られる一方、文書解析におけるコンテンツの忠実度やチャートで後退が確認されたと指摘しました。@scaling01 は ALE-Bench での進展はないとし、別途 LisanBench で興味深い失敗モードを指摘しました。肯定的な側面として、@jeremyphoward はコーディングにおいて 4.8 が 4.7 や GPT-5.5 よりも過剰に自律的ではなく、より協調的であると発見しました。また @leo_linsky は、これは以前の Anthropic のリリースに対する実質的な製品改善であると評価しています。

Anthropic もまた、有用なプラットフォームレベルの変更をリリースしました：@ClaudeDevs は会話中にシステム指示を発表し、プロンプトキャッシュを破綻させることなく、権威ある会話中のシステムロール更新を実現しました。これは長時間実行されるエージェントセッションやコスト管理において重要です。しかし、価格設定は依然として大きな不満の種となっています：@jeremyphoward は、Anthropic が API の手頃さに対してほとんど手を打っていないと主張し、GPT-5.5 を部分的に好むのは、サブスクリプション/API の経済性がより正当化しやすいからです。全体的な教訓：4.8 は、実用的な利用においてはベンチマークの完全なリセットではなく、意味のある生活の質を向上させるリリースのように見えます。

エージェント・ハーネス、多ターン RL バグ、そして自律性を取り巻くインフラストラクチャ

微妙だが重要な強化学習（RL: Reinforcement Learning）の失敗モードが指摘されました：@ClementDelangue は、多くのツール使用型・多ターン RL 学習ループが静かに破綻している理由を解説する Hugging Face の詳細分析を紹介しました。核心的なバグは、モデル出力のデコード、ツール呼び出しのパース、そして更新された会話の再トークン化を行うことで、トークン化が変化し、勾配がモデルが実際にサンプリングしたシーケンスに対して適用されてしまう点です。提案される解決策は、厳格な「Token-In, Token-Out」ルールです：サンプリングされたトークンを再エンコードしてはいけません。ターン全体で単一のトークンバッファを維持します。@johnschulman2 は、レンダラーがメッセージとトークンの間の基盤となるインフラストラクチャであり、失敗モードには訓練/テストの不整合、キャッシングの非効率性、プロンプト注入リスクが含まれるというより広範な点を強調しました。

ハーネス設計は独自の最適化分野へと進化しています：@omarsar0 が Effective Feedback Compute (EFC) に関する研究を紹介し、生のトークン数やツール使用回数ではエージェントの成功を説明する能力が低いと主張しました。一方、EFC は R²値が最大 0.99 に達し、これはハーネスの品質が総活動量よりも重要であることを示唆しています。これは @LangChain のような製品化されたチューニング取り組みとも一致しており、Deep Agents v0.6 では Qwen/Kimi/DeepSeek といったモデルから、最先端 API と比較して 20 倍以上のコスト削減で強力なパフォーマンスを得るために、ハーネスプロファイルをファーストクラスとして扱っています。また @hwchase17 は「異なるモデルには異なるプロンプトやツールが必要である」と明確に位置付けています。@vllm_project はネイティブのウェイト同期 API を実装し、非同期 RL における一時停止/再開機能を改善しました。その後、fastokens（Rust ベースの BPE トークナイザー）を追加し、長文コンテキストやエージェントワークロードにおける CPU のトークン化ボトルネックを削減しました。

議論は「シングル vs マルチエージェント」から、「どこで抽象化が効果を発揮するか」へとシフトしています：@OfirPress は、現在のマルチエージェントシステムは主に速度向上のものであり、能力の解放ではないと主張しました。一方、@scaling01 はこれとは対照的な見解を持ち、群れのようなトレーニングスタイルがより優れた計画や超知能のような振る舞いを生み出すと期待しています。いずれにせよ、実用的なトレンドは明確です：より多くのチームがエージェントの観測可能性、トレース、継続的改善ループを中心に構築しており、例えば @Vtrivedy10 は生産環境からのトレースを SFT（Supervised Fine-Tuning）や蒸留、長期ホライゾンの継続学習のために活用する取り組みを行っています。

オープンモデル、ローカル AI、そして OSS ツールチェーンの強化

ローカルファーストおよびオープンウェイトの勢いは引き続き高まっています：@LangChain は、2026 年 4 月には AI チームの 3 分の 1 がオープンウェイトモデルを実行していると発表し、9 ヶ月前の 5 分の 1 から増加しました。一方、@EpochAIResearch の推計では、オープンウェイトモデルは最先端のプロプライエタリモデルに約 4 ヶ月遅れているとのことです。ツールチェーン側では、@ggerganov が llama.app を立ち上げ、llama.cpp に公式ウェブサイト、統一されたインストーラー、およびローカルデプロイとサードパーティ製エージェントの統合を容易にするための単一の llama エントリーポイントを提供しました。また、@ollama は Ollama を通じて OpenJarvis を発表し、これは Stanford/Hazy の「ワットあたりの知能（Intelligence Per Watt）」という枠組みに明示的に結びついたローカルファーストのパーソナル AI です。

オープンインフラストラクチャはよりエンタープライズ志向へと変化しています：@ClementDelangue は、Hugging Face 上のモデルとデータセットの約 50% が現在プライベート化されており、HF のストレージ/バケット提供に伴って増加していると指摘しました。これは、Hugging Face が単なるパブリック OSS インフラストラクチャであるという考えに対する重要な是正です。@abidlabs は、CPU サーバーレス GPU CI において GitHub Runners に代わって Hugging Face Jobs を使用している事例を示しました。また、@DSPyOSS、@dbreunig、および他の関係者は、4.0 のリリースに先駆けて DSPy のドキュメントとトップページを再設計して公開し、純粋なプロンプトエンジニアリングではなく、プログラム可能な AI システムへのオンボーディングに焦点を当てました。

ライセンスと許容性は戦略的なレバーになりつつあります：@kimmonismus は、NVIDIA が 4 つのオープンモデルファミリーを Linux Foundation OpenMDW-1.1 に移行し、重み・コード・ドキュメント・データにわたる法的な断片化を削減した点を指摘しました。新しい許容性のあるデータリリースも重要です：@keshigeyan は、視覚生成用の明示的な研究および商用利用が可能である 100M ペアの許容画像コーパスと 1M ペアのベンチマークを含む GPIC を紹介しました。

Google/OpenAI の製品展開拡大：管理型エージェント、Gemini Spark/Omni、Windows 上の Codex

Google は「管理型エージェント」のスタックを API から消費者向け製品へと広げています：@_philschmid は Gemini API における Managed Agents を紹介しました。これは単一の API 呼び出しで、コード実行・ウェブアクセス・ファイル入出力を備えたサンドボックス化された Linux 環境をプロビジョニングするものです。消費者向け側では、@GeminiApp が Gemini Spark を米国の AI Ultra サブスクライバーに向けて展開し、ユーザーの指示のもとデジタルエコシステム全体で動作できる 24 時間年中無休のパーソナルエージェントとして機能しています。Google はまた、Gemini Omni のマルチモーダル生成・編集デモ（例：製品スレッド）を継続して推進し、動画・映画制作におけるクリエイティブワークフロー向けの Google Flow Agent を発表しました（スレッド）。

OpenAI の Codex は、永続的なリモート開発オペレーターに近づいています：@OpenAI と @OpenAIDevs が Windows でのコンピュータ操作機能を追加し、ChatGPT モバイルアプリからの遠隔操縦も可能になりました。続く UX 改善には、バックグラウンドエージェント用の安定したアイデンティコン（identicons）や、過去のチャットコンテンツ全体を検索する機能が含まれています（@OpenAIDevs）。また、@reach_vb が Windows コントロール、モバイルによる遠隔アクセス、プロフィールおよびタスク統計に関する広範な Codex の更新を要約しました。別に OpenAI は、@michpokrass によると、迎合性（sycophancy）、事実の正確さ、多言語性能を向上させるために gpt-5.5 instant をアップデートしました。

これらはすべて、より垂直統合されたエージェントスタックへの道を示しています：モデル + ハーネス + サンクボックス + UI + 遠隔制御 + 価格/クォータ。Google は Gemini のクォータ調整をスムーズ化しており（@joshwoodward）、OpenAI は Codex の運用範囲を拡大し、Cursor はサブエージェントベースの承認ルーティングを備えた自動レビューモードを追加しました（ツイート）。共通するパターンは、「チャットボット」から離れ、ポリシーとメモリを持つ管理された実行環境へと移行している点です。

注目に値する研究およびシステム論文

検索、情報取得、および記憶：@TheTuringPost は、ハーバード大学と MIT が発表した双方向進化探索（Bidirectional Evolutionary Search: BES）を紹介しました。これは前方探索と後方分解、そして進化的演算子を組み合わせた手法です。報告された成果には、MuSiQue における Llama-3.2-3B-Instruct の性能が 4.0% から 7.0% に向上したことが含まれます。情報取得の分野では、@_reachsumit が Latent Terms を指摘し、凍結された密型検索器から SAE（Sparse Autoencoder）を用いて BM25 対応のスパース特徴量を抽出できることを示しました。また、@topk_io はより効率的な後期相互作用推論を実現する Iso-ModernColBERT をオープンソース化しました。

継続学習と信念・状態管理：@HuggingPapers は BeliefTrack を要約し、最適化された信念状態管理により長期の推論失敗が 70% 以上削減されると主張しました。@AndrewLampinen は、継続学習分野が干渉に過度に焦点を当てており、正の転移（positive transfer）への注目が不足していると指摘しました。一方、@victor207755822 は、自己反復と継続学習（CL）に焦点を当てた DeliAutoResearch の SKILL 論文の第 2 報を発表しました。

マルチモーダル/世界モデル/ロボティクス：NVIDIA に所属する研究者らの成果として、γ-World が挙げられます。これは 1 秒あたり 24 フレームでストリーミングされる生成型マルチエージェント世界モデルです（ツイート）。また、minWM はリアルタイム対話型ビデオ世界モデルのフレームワークです（ツイート）。ロボティクス分野では、@_akhaliq が Qwen-VLA を共有し、@inventorOli は Robostral の言語追従能力と操作性能の向上をデモしました。常時オンで能動的に動作するエージェントについては、@dair_ai が LLM の起動判断を 220MiB の時系列グラフエンコーダに置き換えた研究を紹介し、平均 F1 スコアが +16.7 向上するとともに、実行速度が 4〜83 倍高速化されたことを示しました。

エンゲージメント上位のツイート

OpenAI / バイオロジー：@OpenAI は、公衆衛生と生物防衛のための信頼できるアクセス用バイオツールを備えた「Rosalind Biodefense」を発表しました。

Google / コンシューマーエージェント：@GeminiApp が、米国の AI Ultra ユーザー向けに常時稼働型のパーソナルエージェント「Spark」をリリースしました。

OpenAI / 開発者ツール：@OpenAI は Codex の Windows サポートを発表し、@OpenAIDevs はコンピューター操作機能を Windows およびモバイルの遠隔操縦へと拡張しました。

llama.cpp UX マイルストーン：@ggerganov が、ローカル AI 向けの統合インストーラーと CLI エントリーポイントを持つ「llama.app」をリリースしました。

HF / RL の正しさ：@ClementDelangue は、ツールを用いた多ターン強化学習（RL）における「Token-In, Token-Out」の警告を強調しました。

オープンとクローズドのタイミング格差：@EpochAIResearch によると、オープンウェイトモデルは現在、最先端モデルより約 4 ヶ月遅れていると推定されています。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. ローカル LLM パフォーマンス：MoE リリース、量子化、VRAM 節約

StepFun 3.7 Flash (アクティビティ：637): StepFun は、総パラメータ数 196B、アクティブパラメータ数 11B、内蔵 ViT（Vision Transformer）1.8B を備えたマルチモーダル MoE（Mixture of Experts）である Step 3.7 Flash をリリースしました。これは最大 400 TPS の高スループットエージェントワークフロー向けに宣伝されており、約 128GB の RAM でローカル実行可能と報告されています。報告されたベンチマークでは、フラッシュクラス/ローカルモデルとしては異例の強さを示しており、SWE-Bench Pro は 56.26%、DeepSearchQA F1 は 92.82%、HLE w/tools（ツール付き）は 47.2 です。また、Terminal-Bench、Toolathlon、ClawEval、およびその他のエージェント/ツール使用タスクにおいて、Step 3.5 Flash から大幅な向上が見られます。直接モデルアーティファクトは Hugging Face で BF16、FP8、NVFP4、GGUF の形式で利用可能であり、llama.cpp への day-0 サポート PR および関連する MTP（Multi-Token Prediction）作業が llama.cpp#23274 にあります。コメント投稿者たちはこのモデルを技術的に奇妙であると特徴付けています：その隠れ状態や思考の痕跡はほぼ無意味である一方、最終回答は「完璧」であり、1TB を超えるはるかに大きなモデルと競合するほどです。あるユーザーは、以前の Step 3.5 の「無限思考」の問題が修正されたようだと述べています。ローカル展開については慎重な期待感が広がっており、特に 4x3090 クラスのハードウェアを備えたユーザーにとって歓迎すべきことです。また、StepFun がフォークのみを維持するのではなく、llama.cpp へのサポートをアップストリームに提供した点も評価されています。

StepFun のアップストリーム PR (ggml-org/llama.cpp#23845) により、day-0 の llama.cpp 対応が実現されました。これは Step 3.5 のフォークベースのサポートとは対照的です。MTP サポートのための別のコミュニティ PR は ggml-org/llama.cpp#23274 に存在しますが、コメント投稿者らはこれが Step 3.7 と現在の master ブランチに合わせて更新が必要であると指摘しています。

NVFP4 チェックポイントの vLLM ナイトリーテストでは、2x Pro 6k 環境で 64 の並列な浅文脈リクエストを実行した結果、約 2200 tok/s を達成しました。報告された設定は、tensor-parallel-size 2、--enable-expert-parallel、--quantization modelopt、--kv-cache-dtype fp8、--reasoning-parser step3p5、および StepFun のツール呼び出しパーサーを使用するものでした。vLLM は GPU KV キャッシュサイズが 1,667,645 トークン、262,144 トークン/リクエストあたりの最大並列度が 6.36 倍であると報告しました。

原文を表示

a quiet day.

AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics

Opus 4.8 landed into a noisy, mixed eval landscape: multiple independent benches converged on “incremental but not dominant.” @arena pushed 200+ frontend/code tests comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; @theo reported CursorBench shows it as more efficient but slightly worse than 4.7 within margin of error; @jerryjliu0 and @llama_index found small gains on tables/layout but regressions on content faithfulness/charts in document parsing; @scaling01 said no progress on ALE-Bench and separately flagged interesting failure modes on LisanBench. On the positive side, @jeremyphoward found 4.8 less over-agentic and more cooperative than 4.7/GPT-5.5 in coding, while @leo_linsky called it a tangible product improvement over prior Anthropic releases.

Anthropic also shipped useful platform-level changes: @ClaudeDevs announced mid-conversation system instructions without breaking prompt cache, plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: @jeremyphoward argued Anthropic has done little for API affordability, preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset.

Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy

A subtle but important RL failure mode got called out: @ClementDelangue highlighted a Hugging Face deep-dive on why many tool-using, multi-turn RL training loops are silently broken. The core bug: decoding model output, parsing tool calls, then re-tokenizing the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict “Token-In, Token-Out” rule: never re-encode sampled tokens; keep a single token buffer across turns. @johnschulman2 reinforced the broader point that renderers are foundational infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk.

Harness design is becoming its own optimization discipline: @omarsar0 surfaced work on Effective Feedback Compute (EFC), claiming raw token/tool counts explain agent success poorly while EFC reaches R² up to 0.99, implying harness quality matters more than gross activity. This lines up with productized tuning efforts like @LangChain, where Deep Agents v0.6 makes harness profiles first-class to get strong performance from Qwen/Kimi/DeepSeek at 20x+ lower cost than frontier APIs, and @hwchase17 explicitly framing “different models need different prompts/tools.” @vllm_project shipped native weight syncing APIs and improved pause/resume for async RL, and later added fastokens, a Rust BPE tokenizer to reduce CPU tokenization bottlenecks in long-context/agentic workloads.

Debate is shifting from “single vs multi-agent” to where the abstraction pays: @OfirPress argued current multi-agent systems are mostly speedups, not capability unlocks; @scaling01 took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around agent observability, traces, and continual improvement loops, e.g. @Vtrivedy10 on mining production traces for SFT/distillation and long-horizon continual learning.

Open Models, Local AI, and the OSS Toolchain Tightening Up

Local-first and open-weight momentum continues to rise: @LangChain said 1 in 3 AI teams ran an open-weights model in April 2026, up from 1 in 5 nine months earlier; @EpochAIResearch estimated open-weight models now lag frontier proprietary models by about four months. On the toolchain side, @ggerganov launched llama.app, giving llama.cpp an official website, a unified installer, and a single llama entrypoint aimed at easier local deployment and third-party agent integration. @ollama announced OpenJarvis as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy’s “Intelligence Per Watt” framing.

Open infrastructure is getting more enterprise-shaped: @ClementDelangue noted that ~50% of models and datasets on Hugging Face are now private, rising with HF’s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. @abidlabs showed Hugging Face Jobs replacing GitHub runners for CPU/serverless GPU CI. @DSPyOSS, @dbreunig, and others shipped a redesigned DSPy docs/front page ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting.

Licensing and permissiveness are becoming strategic levers: @kimmonismus highlighted NVIDIA moving its four open model families to Linux Foundation OpenMDW-1.1, reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: @keshigeyan introduced GPIC, a 100M-pair permissive image corpus plus 1M-pair benchmark for visual generation, with explicit research + commercial usability.

Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows

Google is widening the “managed agent” stack from API to consumer product: @_philschmid showed Managed Agents in the Gemini API: a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, @GeminiApp rolled out Gemini Spark to U.S. AI Ultra subscribers as a 24/7 personal agent that can operate across a user’s digital ecosystem under direction. Google also kept pushing Gemini Omni multimodal generation/editing demos (example, product thread) and announced Google Flow Agent for creative workflows in video/film production (thread).

OpenAI’s Codex is moving closer to a persistent remote dev operator: @OpenAI and @OpenAIDevs added computer use on Windows, including remote steering from the ChatGPT mobile app. Follow-on UX improvements included stable identicons for background agents and search across prior chat content (@OpenAIDevs); @reach_vb summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated gpt-5.5 instant to improve sycophancy, factuality, and multilingual performance per @michpokrass.

This all points to more vertically integrated agent stacks: model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini (@joshwoodward); OpenAI is expanding Codex’s operating surface; Cursor added auto-review mode with subagent-based approval routing (tweet). The common pattern is less “chatbot,” more managed execution environment with policy and memory.

Research and Systems Papers Worth Attention

Search, retrieval, and memory: @TheTuringPost highlighted Bidirectional Evolutionary Search (BES) from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0%. In retrieval, @_reachsumit pointed to Latent Terms, showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. @topk_io open-sourced Iso-ModernColBERT for more efficient late-interaction inference.

Continual learning and belief/state management: @HuggingPapers summarized BeliefTrack, claiming optimized belief-state management cuts long-horizon reasoning failures by 70%+. @AndrewLampinen argued the continual learning field over-focused on interference instead of positive transfer; @victor207755822 presented a second DeliAutoResearch SKILL paper focused on self-iteration and CL.

Multimodal/world models/robotics: NVIDIA-affiliated work included γ-World, a generative multi-agent world model streaming at 24 FPS (tweet), and minWM, a real-time interactive video world model framework (tweet). In robotics, @_akhaliq shared Qwen-VLA, and @inventorOli demoed Robostral’s language-following and manipulation improvements. For always-on proactive agents, @dair_ai surfaced work replacing LLM wake-up decisions with a 220MiB temporal-graph encoder, gaining +16.7 mean F1 while running 4–83x faster.

Top tweets (by engagement)

OpenAI / biology: @OpenAI on Rosalind Biodefense announced trusted-access biology tooling for public health and biodefense.

Google / consumer agents: @GeminiApp on Spark rolled out its always-on personal agent to AI Ultra users in the U.S.

OpenAI / dev tools: @OpenAI on Codex Windows support and @OpenAIDevs expanded computer use to Windows plus mobile remote steering.

llama.cpp UX milestone: @ggerganov launched llama.app with a unified installer and CLI entrypoint for local AI.

HF / RL correctness: @ClementDelangue amplified the Token-In, Token-Out warning for multi-turn RL with tools.

Open vs closed timing gap: @EpochAIResearch estimated open-weight models are now about 4 months behind the frontier.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Performance: MoE Releases, Quants, VRAM Savings

StepFun 3.7 Flash (Activity: 637): StepFun released Step 3.7 Flash, a multimodal MoE with 196B total parameters, 11B active, and a built-in 1.8B ViT, advertised for high-throughput agent workflows up to 400 TPS and reportedly runnable locally with ~128GB RAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro 56.26%, DeepSearchQA F1 92.82%, HLE w/tools 47.2, plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in BF16, FP8, NVFP4, and GGUF, with day-0 llama.cpp support PR and related MTP work in llama.cpp#23274. Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be “perfect” and competitive with much larger >1TB models; one user says the prior Step 3.5 “infinite thinking” issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with 4x3090-class hardware, and appreciation that StepFun upstreamed llama.cpp support instead of only maintaining a fork.

There is day-0 llama.cpp enablement via StepFun’s upstream PR: ggml-org/llama.cpp#23845, contrasting with Step 3.5’s fork-based support. A separate community PR for MTP support exists at ggml-org/llama.cpp#23274, though commenters note it needs updating for Step 3.7 and current master.

A vLLM nightly test of the NVFP4 checkpoint on 2x Pro 6k with 64 concurrent shallow-context requests reached about 2200 tok/s. The reported config used tensor-parallel-size 2, --enable-expert-parallel, --quantization modelopt, --kv-cache-dtype fp8, --reasoning-parser step3p5, and StepFun tool-call parsing; vLLM reported GPU KV cache size 1,667,645 tokens and max concurrency 6.36x for 262,144 tokens/request.

この記事をシェア

Smol AI News★42026年5月26日 14:44

今日は何も大きな出来事はありませんでした

The Zvi★42026年5月30日 05:49

Claude Opus 4.8：システムカードの発表

Simon Willison Blog★42026年5月29日 08:59

Claude Opus 4.8：「控えめだが実感のある改善」

Anthropic が新モデル Claude Opus 4.8 を公開し、前作よりコスト削減に向けた開発も進めていると発表した。

ニュース一覧に戻る元記事を読む

今日は何も大きな出来事はありませんでした

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. ローカル LLM パフォーマンス：MoE リリース、量子化、VRAM 節約

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Performance: MoE Releases, Quants, VRAM Savings

関連記事

今日は何も大きな出来事はありませんでした

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. ローカル LLM パフォーマンス：MoE リリース、量子化、VRAM 節約

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Performance: MoE Releases, Quants, VRAM Savings

関連記事