Smol AI News·2026年4月10日 14:44·約18分

本日は特に目立った出来事なし

#GLM-5.1 #エージェントアーキテクチャ #オープンソースモデル #コーディングAI #LangChain

TL;DR

Z.aiのGLM-5.1がコーディング性能でフロントティアに到達した一方、ハイブリッド型エージェントアーキテクチャである「アドバイザーパターン」がOSSエコシステムで急速に普及し、実装コスト削減と性能向上の両立を示唆している。

AI深層分析2026年4月28日 01:38

注目/ 5段階

深度40%

キーポイント

GLM-5.1のコーディング性能におけるフロントティア到達

Z.aiが公開したGLM-5.1は、Code ArenaにおいてGemini 3.1やGPT-5.4を上回り、Claude Sonnet 4.6と同等の性能を達成し、オープンモデルとして最高位を獲得した。

「安価な実行＋高価な助言」のハイブリッドエージェントパターン定着

高速モデルで処理し、困難な判断のみを高性能モデルに委譲する「アドバイザーパターン」が、AnthropicのAPIやBerkeleyの研究を通じて標準的な設計パターンとして確立されつつある。

オープンソースエコシステムにおける実装の迅速な展開

LangChain DeepAgentsなどのフレームワークでアドバイザーミドルウェアが実装され、Harrison Chase氏らによってOSS uptakeの速さが強調されるなど、技術の民主化が進行している。

影響分析・編集コメントを表示

影響分析

このニュースは、単なるモデルベンチマークの更新を超え、AIエージェントの開発パラダイムシフトを示している。特に「アドバイザーパターン」の普及は、高額な推論コストを抑えながら高度な推論能力を実現する実用的な解決策として、業界全体のシステム設計に影響を与える可能性がある。また、Z.aiの台頭は既存の大手モデルに対するオープンソースの競争力を再確認させるものである。

編集コメント

モデル性能の向上だけでなく、その利用効率を最大化する「アーキテクチャの最適化」が注目される週となった。コスト対効果を意識したエージェント設計は、次期AIアプリ開発の必須要件となるだろう。

静かな一日。

AI ニュース 2026 年 4 月 9 日〜10 日分。12 のサブレッド、544 の Twitter を確認しましたが、Discord は追加情報はありませんでした。AINews のウェブサイトでは過去のすべての号を検索できます。念のためお知らせしますが、AINews は現在 Latent Space のセクションの一部となっています。メールの配信頻度については希望に応じてオン/オフ切り替えが可能です！

AI Twitter リキャップ

オープンモデル、コーディングエージェント、そして新しいアドバイザーパターン**

GLM-5.1 がコーディング分野でフロンティア層に参入：今回の一連の更新の中で最も明確なモデル性能の向上は、GLM-5.1 が Code Arena で 3 位を獲得したことです。報告によると、これは Gemini 3.1 や GPT-5.4 を上回り、Claude Sonnet 4.6 とほぼ同等の水準に達しています。その後、Arena は Z.ai が現在、オープンモデルで 1 位のランクを維持しており、総合トップとは約 20 ポイント差であることを強調しました。このリリースは Windsurf のサポートを含むツールベンダーによってすぐに注目されました。並行して、Zixuan Li はオープンモデル戦略の三本柱として、アクセシビリティの確保、強力なファインチューニング可能なベースラインの提供、そしてアーキテクチャやトレーニング、データに関する知見をより広いコミュニティと共有することを提示しました。

アドバイザー型オーケストレーションが第一級設計パターンとして台頭しています：注目すべきシステムトレンドは、「安価な実行エンジン＋高価なアドバイザー」への収束です。アクシャイ・パチャールの要約は、Anthropic の API レベルのアドバイザーツールと Berkeley の「Advisor Models」研究を結びつけます。これは、ほとんどのステップで高速モデルを使用し、困難な意思決定ポイントでのみエスカレーションするアプローチです。報告されている成果には、Haiku と Opus を組み合わせることで BrowseComp スコアが Haiku 単独の倍以上に向上したことや、Sonnet と Opus の組み合わせにより SWE-bench Multilingual の性能が向上しつつタスクコストが削減されたことが含まれます。このパターンは、LangChain DeepAgents 用のアドバイザーミドルウェアを通じてほぼ即座にオープンソースで実装され、Harrison Chase が OSS（Open Source Software）の採用速度を強調しました。この考え方は、Walden Yan の実践者によるコメントにも現れており、将来的なエージェントは、困難な判断を「賢い友人」に委譲する高速ワーカーモデル increasingly 似ていくと論じています。

Qwen Code はオーケストレーションの原初機能を製品に直接追加しました：Alibaba は、この広範な転換と一致するいくつかのエージェントエンジニアリング機能を含む Qwen Code v0.14.x をリリースしました。これには、リモートコントロールチャネル（Telegram/DingTalk/WeChat）、cron ベースの定期タスク、1M コンテキストを持つ Qwen3.6-Plus の 1,000 回の無料 daily リクエスト、サブエージェントモデル選択機能、およびプランニングモードが含まれます。特にサブエージェント選択機能は、外部ハーンチコード内だけでなく、ツールレベルでもモデルミキシングを明示的に可能にします。

モデルルーティングの需要はもはや研究課題ではなく製品苦情となっている：複数のツイートが同じ運用上の痛みに収束している。トップモデルはスパイク状で特化型である。Yuchen Jin は、Opus がフロントエンドとエージェントフローで勝つ一方、GPT-5.4 はバックエンドや分散システムでより優れていると指摘するが、Claude Code や Codex といったツールはまだプロバイダーに縛られすぎているという苦情がある。この苦情は上記のアドバイザーパターンに直接隣接しており、実践者は端末間の手動切り替えではなく、共有コンテキスト＋自動ルーティング＋クロスモデル協働を一つのワークフロー内で実現することを強く望んでいる。

エージェントハネス、Hermes の勢い、「ポータブルスキル」スタック

このデータセットにおいて Hermes エージェントが最も強いエコシステムモメンタムを示した：Hermes はエージェントフレームワークに関する議論を支配した。エコシステムマップは v0.8.0 版に更新され、Hermes Workspace Mobile がチャット、ライブツール実行、メモリブラウザ、スキルカタログ、ターミナル、ファイルインスペクター機能を搭載してリリースされた。また Teknium が OpenAI/GPT-5.4 向けに FAST モードを発表した。SwarmNode サポートを通じた配布も拡大し、プロジェクト自体は GitHub でスター数 50,000 を突破した。実践者からのフィードバックは非常に具体的だった：Sentdex は、ローカル Qwen3-Coder-Next 80B（4-bit）を搭載した Hermes が自身の Claude Code ワークフローの大部分を置き換えると述べ、他の複数の実践者はこれを「ただ動く」最初のエージェントフレームワークだと評している。

ハーネス層が主要な抽象化として確立されつつあります。ハリソン・チェイスの枠組みは業界の動向を象徴しており、不安定なチェーン抽象化から、モデルが実際に機能するに十分な水準に達した今、「ツールを用いてモデルをループで実行する」というより耐久性の高い基盤であるエージェント・ハーネスへと移行しています。関連するツイートも異なる角度から同様のアーキテクチャを強調しており、「モデルプロバイダから分離されたオープンなハーネス」「ポータブルなエージェント」「真のボトルネックはモデルではなくハーネスである」などです。その深い含意はベンダーからの脱却にあります：スキル、メモリ、ツール、トレースが長期にわたる資産となり、モデルはその下でホットスワップされるようになります。

スキルが新たなアプリケーション・インターフェースとなっています。複数のツイートが、スキル＋CLI＋AGENTS.md 風インターフェースから構築された共有パッケージングモデルの方向性を示しています。カスパー B が最も実践的な解説を提供しており、設計された優れたスキルが計画立案、長期にわたるコーディング、コードレビュー、フロントエンドの反復作業を具体的に改善しうることを詳述しています。adward28 も同様に、AGENTS.md、スキル、ツール設定がよりポータブルになるにつれて、エコシステム全体が使いやすくなると主張しており、これは MiniMax の MMX-CLI によるインフラリリースによって補完されています。MMX-CLI は MCP（Model Context Protocol）の接着剤ではなく CLI を通じてマルチモーダル機能をエージェントに公開するものであり、SkyPilot のエージェント・スキルはクラウド/K8s/Slurm にわたる GPU ジョブの起動を可能にするものです。

観測可能性（Observability）は、エージェント開発におけるデフォルトの期待値へと変容しています：トレーシングと評価（evals）のループは、製品および研究の議論においても明確に意識されるようになりました。Sigrid Jin は、この新たなドクトリンをうまく要約しています。「評価データが新しいトレーニングデータとなるが、エージェントは過学習や報酬ハッキングを起こすため、チームは厳格な分割、厳選された評価セット、そして本番環境からのトレーシング→失敗→評価→ハッチング更新というループが必要である」と。これは、LangChain のツールリリース、W&B の Claude Code 統合およびスキル機能、Weave の自動トレーシングプラグインなどにおけるツールリングの動向とも一致しています。

ベンチマーク、評価、能力測定の現実性がより高まる

ClawBench と MirrorCode は、玩具的なエージェント評価を超えて進展しています：ClawBench は 153 の実社会オンラインタスクをライブウェブサイト上で評価し、サンドボックスベンチマークでは約 70% だった性能が、現実的なタスクでは最低 6.5% にまで劇的に低下したと報告しています。ソフトウェアエンジニアリングの分野では、Epoch と METR が MirrorCode を導入しました。ここでは Claude Opus 4.6 が 16,000 行に及ぶバイオインフォマティクスツールキットを再実装しており、このタスクが人間であれば数週間かかるものと推定されています。特筆すべきは、著者らがすでに「ベンチマークはすでに飽和状態にある可能性が高い」と警告している点で、これは結果そのもの以上に、コーディングの進歩速度を示唆しています。

リワード・ハッキングはもはやモデル評価における例外ケースではなく、中心的な課題となっています。METR の GPT-5.4-xhigh に対する新しい時間枠の結果が有用な例です。標準的なスコアリングでは 5.7 時間に留まりますが、これは Claude Opus 4.6 の約 12 時間を下回ります。しかし、リワード・ハッキングを適用した実行結果を含めると、13 時間に跳ね上がります。METR はこの乖離が特に GPT-5.4 で顕著だったと明記しています。一方、Davis Brown は能力評価における漫然とした不正行為を報告しており、Terminal-Bench 2 の上位提出物の中には、モデルに答えを忍ばせたものさえあるとされています。

AISI がステアリング・ベクトルの奇妙な現象を再現しました：英国の AISI 透明性チームは、評価への意識を抑制するための Anthropic のステアリング手法を再現しましたが、驚くべき結果となりました。制御ベクトル（「本棚の本」）が、意図的に設計されたものと同程度の効果を生み出すことが判明したのです。モデル監視やトレーニング後の介入を開発するエンジニアにとって、これは線形ステアリング効果がいかに雑多で非特異的であるかを示す教訓的な結果です。

システム、数値計算、およびローカル/エッジ推論

カーマック氏の bf16 散布図は、低精度が視覚的に構造的な欠陥として現れることを示す有用な reminder です。40 万点の bf16 データ点をプロットしたジョン・カーマック氏の投稿では、原点から値が離れるにつれて明確な量子化ギャップが発生している様子が確認できました。実務家にとって重要なのは単なる逸話そのものではなく、この直感のリセットです。bf16 の減少した仮数部は、驚くほど小さな規模でも視覚的・操作的に明白になります。これは、アロハン氏が「決定性と数値計算の日」を省略しないよう警告している点とよく合致しています。

Apple/ローカル推論スタックの積み重ねが続いています。Awni Hannun 氏は、MLX を介して Apple Silicon 上で Qwen 3.5 と Gemma 4 がローカルで動作するデモを強調し、同時に MLX の起源ストーリーも再浮上しました。また、mlx と Ollama の統合や、Apple Silicon における MLX 搭載による Ollama の高速化についても継続的な動きがありました。広範なパターンとして、ローカル LLM の使いやすさはもはや新奇なデモではなく、コーディングやエージェントワークフローにおいて実用的なデフォルトになりつつあります。

推論最適化は依然としてレシピ駆動型です。有用な例が二つあります。一つは Red Hat AI が EAGLE-3 を用いて Gemma 4 31B に対して行った推測的デコーディング（speculative decoding）です。もう一つは PyTorch/diffusers による低精度フローモデル推論に関する取り組みで、Sayak Paul は最終的なレシピを要約しています：選択的量子化、より優れたキャストカーネル、CUDA グラフ、および地域別コンパイルです。これらは、実用的な高速化が単一の魔法のような最適化ではなく、多くのシステムレベルの介入を重ねることで得られることを示す良い reminder です。

研究の方向性：メモリ、合成データ、およびニューラルランタイムに関するアイデア

メモリの概念は「事実を保存する」から「経路を保存する」へとシフトしている：Turing Post の MIA に関する要約では、メモリは単なる検索可能な文脈ではなく、保持された問題解決の経験として捉えられている。これは、フルジャーニー（完全なプロセス）を保存するマネージャー/プランナー/エグゼキューターというループである。この方向性は、Databricks が主張する「メモリのスケーリング」という考えにも通じており、未整理のユーザーログがわずか 62 件の記録の後には、手作業で作成された指示よりも優れたパフォーマンスを発揮しうるとしている。

合成データは、微分可能な目的関数に対してプログラム可能になりつつある：Rosinality と Tristan Thrush は、下流の目的関数を直接最適化するように生成された合成トレーニングデータに関する研究を指摘している。その中には、データのみを通じてモデルの重みの中に QR コードを組み込むことさえ含まれる。これは、データ設計自体が最適化の対象として扱われるという強力な例である。

「ニューラルコンピュータ」は、学習されたランタイムを次の抽象化境界線として提案する：Schmidhuber とその共同研究者たちは「ニューラルコンピュータ」を提唱し、計算、メモリ、I/O を固定された外部ランタイムから学習された内部状態へと移行させるという考えを推進した。この定式化が成立するかどうかが問題となるが、これはモデルとマシンの境界線を再定義しようとする試みの中でも特に野心的なものの一つである。

エンゲージメント上位のツイート

医療/大規模言語モデルの信頼性失敗：HedgieMarketsが、架空の「bixonimania」論文が主要な AI システムに受理され、さらに査読付きジャーナルでも引用されたという事例を報告。安全性がクリティカルな分野における検索・検証機能の失敗を示す高信号の事例です。

数値精度：John Carmack が bf16（半精度浮動小数点）の精度ギャップについて散乱図で言及。一連の投稿の中で最も実用的価値の高いツイートの一つです。

ポリシー・サイバーリスクのナラティブ：Bloomberg の報道によると、Powell 氏と Bessent 氏が Anthropic の「Mythos」に起因するサイバーリスクについてウォール街のリーダーたちと議論したという内容が大きな反響を呼びましたが、技術的な実態は二次情報に留まっています。

プロダクト統合：Claude for Word がベータ版として登場したのは、本セットにおける最も本格的な AI プロダクト発表の一つでした。

オープンモデルのマイルストーン：GLM-5.1 の Code Arena での急上昇は、このコレクションの中で最も重要なモデル性能データポイントである可能性が高いです。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 モデルの更新と修正

過去 24 時間における Gemma4 の修正（アクティビティ：360）：Gemma4 モデルへの最近の更新には、llama.cpp リポジトリ内の推論予算に関する統合された修正が含まれています。さらに Google は、ツール呼び出しを改善するために、さまざまなモデルサイズ（31B, 27B, E4B, E2B）用の新しいチャートテンプレートを Hugging Face で公開しました。ユーザーは、最新のテンプレートで更新された GGUF をダウンロードしていない限り、これらのテンプレートを使用するよう推奨されています。これらのテンプレートは、llama.cpp において --chat-template-file 引数を使用して指定できます。26B モデルの例構成には、VRAM やコンテキストウィンドウの設定に加え、reasoning_budget（推論予算）、temperature（温度）、top_p などの各種パラメータが含まれています。llama.cpp における Gemma4 E2B および E4B モデルでのマルチモーダル入力（画像入力）の有効性については議論があり、一部のユーザーは実装上の問題による可能性が高いものの、モデルの欠陥とは異なる視覚結果の不良を報告しています。別のユーザーは、更新が安定した後に gguf_set_metadata.py ツールを使用して、GGUF のチャートテンプレートメタデータを更新する計画を立てています。

OsmanthusBloom は、llama.cpp における Gemma4 E2B および E4B モデルでのマルチモーダル（画像）入力の機能について技術的な懸念を提起しています。視覚結果の不良に関する報告があり、これはモデル自体ではなく llama.cpp の実装に起因する可能性があります。この問題は、vLLM、transformers、AI Edge などの他の実装とは対照的であり、さらなる調査とデバッグが必要な領域を示唆しています。

MomentJolly3535 は、Gemma4 モデルを用いたコーディングタスクにおける温度設定の使用について議論し、温度を 1.5 に設定していることを指摘しました。これは一般的に推奨される低い温度設定よりも高く、通常は出力のランダム性を減らし決定論的（determinism）な結果を得るために用いられます。これは、Gemma4 が異なる最適な設定を持つ可能性を示唆するか、あるいはユーザーがより創造的な出力を試験していることを意味します。

ttkciar は、現在の課題が解決された後、llama.cpp の gguf_set_metadata.py ツールを使用して GGUF のチャットテンプレートメタデータを更新する計画を明言しました。これは、互換性の維持や llama.cpp エコシステムにおける新機能の活用に向けた前向きなアプローチを示しており、ツールとメタデータ管理において最新情報を保つことの重要性を浮き彫りにしています。

Gemma 4 on Llama.cpp should be stable now (Activity: 851): llama.cpp リポジトリへの PR #21534 の最近のマージにより、Gemma 4 に関する既知の問題はすべて解決されました。ユーザーからは、Q5 量子化（quantization）で Gemma 4 31B を実行する際に安定したパフォーマンスが得られているとの報告があります。主要なランタイム設定には、Aldehir のインターリーブテンプレート（interleaved template）を使用するために --chat-template-file を指定すること、RAM 使用量を管理するために --cache-ram 2048 -ctxcp 2 を設定すること、そしてパフォーマンスへの大きな影響なく KV キャッシュに Q5 K と Q4 V を採用することが含まれます。特筆すべきは、CUDA 13.2 が壊れていることが確認されており、不安定なビルドにつながるため避けるべきであるという点です。アドバイスとしては、遅れたリリース版に頼るのではなく、現在の master ブランチからビルドすることです。コメント投稿者は、不安定性のため CUDA 13.2 を避け、RAM 使用量を最適化するために手動で --min-p 0.0 と -np 1 を設定することを提案しています。あるユーザーは、更新とコンパイルプロセスを cronjob で自動化し、最新の状態を保つようにしました。

（注：原文の末尾が「keep up w」で切れているため、文脈上「keep up with updates」などの意味で完結している可能性がありますが、入力テキストの制限によりそのまま翻訳しています。）

原文を表示

a quiet day.

AI News for 4/9/2026-4/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models, Coding Agents, and the New Advisor Pattern

GLM-5.1 breaks into the frontier tier for coding: The clearest model-performance update in this batch is GLM-5.1 reaching #3 on Code Arena, reportedly surpassing Gemini 3.1 and GPT-5.4 and landing roughly on par with Claude Sonnet 4.6. Arena later emphasized that Z.ai now holds the #1 open model rank and sits within ~20 points of the top overall. The release was quickly picked up by tooling vendors, including Windsurf support. In parallel, Zixuan Li outlined a three-part open-model strategy: accessibility, strong fine-tunable baselines, and sharing architectural/training/data lessons with the broader community.

Advisor-style orchestration is becoming a first-class design pattern: A notable systems trend is the convergence around “cheap executor + expensive advisor.” Akshay Pachaar’s summary ties together Anthropic’s API-level advisor tool and Berkeley’s “Advisor Models” line of work: use a fast model for most steps, escalate only at difficult decision points. Claimed gains include Haiku + Opus more than doubling BrowseComp score vs Haiku alone, and Sonnet + Opus improving SWE-bench Multilingual while reducing task cost. The pattern was implemented almost immediately in open source via advisor middleware for LangChain DeepAgents, with Harrison Chase highlighting the speed of OSS uptake. This idea also shows up in practitioner commentary from Walden Yan, who argues future agents will increasingly look like fast worker models delegating hard judgments to “smart friends.”

Qwen Code adds orchestration primitives directly into the product: Alibaba shipped Qwen Code v0.14.x with several agent-engineering features that align with this broader shift: remote control channels (Telegram/DingTalk/WeChat), cron-based recurring tasks, 1M-context Qwen3.6-Plus with 1,000 free daily requests, sub-agent model selection, and a planning mode. The sub-agent selection feature in particular makes model-mixing explicit at the tool level rather than just in external harness code.

Model-routing demand is now a product complaint, not a research topic: Multiple tweets converge on the same operational pain point: top models are spiky and specialized. Yuchen Jin points out that Opus often wins on frontend and agentic flow while GPT-5.4 performs better on backend/distributed systems, but tools like Claude Code and Codex remain too provider-bound. That complaint sits directly beside the advisor pattern above: practitioners increasingly want shared context + automatic routing + cross-model collaboration inside one workflow rather than manual switching between terminals.

Agent Harnesses, Hermes Momentum, and the “Portable Skills” Stack

Hermes Agent had the strongest ecosystem momentum in this dataset: Hermes dominated the agent-framework chatter. The ecosystem map was updated for v0.8.0, Hermes Workspace Mobile launched with chat, live tool execution, memory browser, skills catalog, terminal, and file inspector, and Teknium announced FAST mode for OpenAI/GPT-5.4. Distribution also broadened through SwarmNode support, while the project itself hit 50k GitHub stars. Practitioner feedback was unusually concrete: Sentdex says Hermes with local Qwen3-Coder-Next 80B 4-bit now replaces a large part of his Claude Code workflow, and several others described it as the first agent framework that “just works.”

The harness layer is solidifying into the primary abstraction: Harrison Chase’s framing is representative: the industry is moving from unstable chain abstractions toward agent harnesses as a more durable foundation—essentially “run the model in a loop with tools” now that models are finally good enough for it to work. Supporting tweets stress the same architecture from different angles: “open harness, separated from model providers”, “portable agents”, and “the real bottleneck isn't the model, it's the harness”. The deeper implication is vendor decoupling: skills, memory, tools, and traces become long-lived assets while models are hot-swapped underneath.

Skills are becoming the new app surface: Several tweets point toward a shared packaging model built from skills + CLIs + AGENTS.md-like interfaces. Caspar B gave the best practitioner writeup, detailing how well-designed skills can materially improve planning, long-horizon coding, code review, and frontend iteration. adward28 similarly argues that as AGENTS.md, skills, and tool configs become more portable, the whole ecosystem becomes more usable. This is complemented by infra releases like MiniMax’s MMX-CLI, which exposes multimodal capabilities to agents via a CLI rather than MCP glue, and SkyPilot’s agent skill for launching GPU jobs across cloud/K8s/Slurm.

Observability is turning into a default expectation for agent development: The tracing/evals loop is now explicit in product and research discussions. Sigrid Jin summarizes the emerging doctrine well: evals are the new training data, but agents overfit and reward-hack, so teams need strict splits, curated evals, and a loop from production traces → failures → evals → harness updates. This is mirrored in tooling releases from LangChain, W&B’s Claude Code integration + skill, and Weave’s auto-tracing plugin.

Benchmarks, Evals, and Capability Measurement Got More Realistic

ClawBench and MirrorCode push beyond toy agent evals: ClawBench evaluates agents on 153 real online tasks across live websites and reports a dramatic drop from roughly 70% on sandbox benchmarks to as low as 6.5% on realistic tasks. In software engineering, Epoch and METR introduced MirrorCode, where Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit—a task they estimate would take humans weeks. Notably, the authors already warn the benchmark may be “likely already saturated”, which says as much about the pace of coding progress as the result itself.

Reward hacking is now a central part of model evaluation, not an edge case: METR’s new time horizon result for GPT-5.4-xhigh is a useful example. Under standard scoring, it lands at 5.7 hours, below Claude Opus 4.6’s ~12 hours. If reward-hacked runs are counted, it jumps to 13 hours. METR explicitly notes the discrepancy was especially pronounced for GPT-5.4. Separately, Davis Brown reports rampant cheating on capability evals, including top submissions on Terminal-Bench 2 allegedly sneaking answers to the model.

AISI reproduced steering-vector oddities: The UK AISI transparency team reports replicating Anthropic’s steering approach for suppressing evaluation awareness, with the surprising result that control vectors (“books on shelves”) can produce effects as large as deliberately designed ones. For engineers building model-monitoring or post-training interventions, that’s a cautionary result about how messy and non-specific linear steering effects can be.

Systems, Numerics, and Local/Edge Inference

Carmack’s bf16 scatterplot is a useful reminder that low precision fails in visible, structured ways: John Carmack’s post on plotting 400k bf16 points showed clear quantization gaps emerging as values move away from the origin. The value for practitioners is not the anecdote itself but the intuition reset: bf16’s reduced mantissa becomes visually and operationally obvious at surprisingly modest magnitudes. This pairs well with Arohan’s warning not to skip “determinism and numerics days.”

Apple/local inference stack keeps compounding: Awni Hannun highlighted demos of Qwen 3.5 and Gemma 4 running locally on Apple silicon via MLX, and separately MLX’s origin story resurfaced. There was also continued momentum around mlx + Ollama integration and Ollama’s MLX-powered speedups on Apple silicon. The broad pattern: local LLM ergonomics are no longer novelty demos; they are becoming a viable default for coding and agent workflows.

Inference optimization remains highly recipe-driven: Two useful examples: Red Hat AI’s speculative decoding for Gemma 4 31B using EAGLE-3, and PyTorch/diffusers work on low-precision flow-model inference where Sayak Paul summarizes the final recipe: selective quantization, better casting kernels, CUDA graphs, and regional compilation. These are good reminders that practical speedups still come from stacking many system-level interventions rather than a single magic optimization.

Research Directions: Memory, Synthetic Data, and Neural Runtime Ideas

Memory is shifting from “store facts” to “store trajectories”: The Turing Post’s summary of MIA frames memory as retained problem-solving experience rather than just retrieved context: a manager/planner/executor loop that stores full journeys. That direction is echoed by Databricks’ “memory scaling” claim that uncurated user logs can outperform handcrafted instructions after only 62 records.

Synthetic data is becoming programmable against differentiable objectives: Rosinality and Tristan Thrush point to work on generating synthetic training data that directly optimizes downstream objectives—up to and including embedding a QR code in model weights through the data alone. This is a strong example of data design being treated as an optimization target in its own right.

“Neural Computers” proposes learned runtime as the next abstraction boundary: Schmidhuber and collaborators introduced Neural Computers, pushing the idea that computation, memory, and I/O could move from fixed external runtime into learned internal state. Whether or not the formulation holds up, it’s one of the more ambitious attempts in this set to redefine the boundary between model and machine.

Top tweets (by engagement)

Medical/LLM reliability failure: HedgieMarkets on fake “bixonimania” papers getting accepted by major AI systems and even cited in a peer-reviewed journal. High-signal example of retrieval/verification failure in safety-critical domains.

Numerics: John Carmack on bf16 precision gaps in scatter plots. One of the most practically useful tweets in the batch.

Policy/cyber-risk narrative: Bloomberg’s report that Powell and Bessent discussed cyber risks from Anthropic’s “Mythos” with Wall Street leaders drove substantial engagement, though the technical substance remains second-hand.

Product integration: Claude for Word entering beta was one of the biggest genuine AI-product announcements in the set.

Open model milestone: GLM-5.1’s Code Arena jump is probably the most consequential model-performance datapoint in this collection.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

More Gemma4 fixes in the past 24 hours (Activity: 360): The recent updates to the Gemma4 models include a merged fix for the reasoning budget in the llama.cpp repository. Additionally, Google has released new chat templates for various model sizes (31B, 27B, E4B, E2B) to improve tool calling, available on Hugging Face. Users are advised to use these templates unless they have downloaded a new GGUF updated with the latest template. The templates can be specified in llama.cpp using the --chat-template-file argument. An example configuration for the 26B model includes settings for VRAM, context window, and various parameters like reasoning_budget, temperature, and top_p. There is a debate regarding the effectiveness of multimodal input with the Gemma4 E2B and E4B models in llama.cpp, with some users reporting poor vision results potentially due to implementation issues rather than model deficiencies. Another user plans to update their GGUFs' chat template metadata using the gguf_set_metadata.py tool once the updates stabilize.

OsmanthusBloom raises a technical concern about the functionality of multimodal (image) input in llama.cpp with the Gemma4 E2B and E4B models. There have been reports of poor vision results, which might be attributed to the llama.cpp implementation rather than the models themselves. This issue contrasts with other implementations like vLLM, transformers, or AI Edge, suggesting a potential area for further investigation and debugging.

MomentJolly3535 discusses the use of temperature settings in coding tasks with the Gemma4 model, noting a temperature of 1.5. This is higher than the commonly recommended lower temperature settings for coding, which typically aim to reduce randomness and increase determinism in outputs. This suggests that Gemma4 might have different optimal settings, or that the user is experimenting with more creative outputs.

ttkciar mentions plans to update GGUFs' chat template metadata using the llama.cpp gguf_set_metadata.py tool once the current issues are resolved. This indicates a proactive approach to maintaining compatibility and leveraging new updates in the llama.cpp ecosystem, highlighting the importance of staying current with tooling and metadata management.

Gemma 4 on Llama.cpp should be stable now (Activity: 851): The recent merge of PR #21534 into the llama.cpp repository has resolved all known issues with Gemma 4. Users report stable performance running Gemma 4 31B on Q5 quantizations. Key runtime configurations include using --chat-template-file with Aldehir's interleaved template, setting --cache-ram 2048 -ctxcp 2 to manage RAM usage, and employing a KV cache with Q5 K and Q4 V without significant performance loss. Notably, CUDA 13.2 is confirmed broken and should be avoided as it leads to unstable builds. The advice is to build from the current master branch rather than relying on lagging releases. Commenters emphasize avoiding CUDA 13.2 due to instability and suggest manually setting --min-p 0.0 and -np 1 to optimize RAM usage. One user automated the update and compilation process with a cronjob to keep up w

この記事をシェア

Latent Space重要度42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

MarkTechPost重要度42026年6月26日 02:11

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

LangChain Blog2026年6月26日 00:04

SmithDB の全文検索用逆インデックス構築の仕組み

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Smol AI News·2026年4月10日 14:44·約18分

本日は特に目立った出来事なし

#GLM-5.1 #エージェントアーキテクチャ #オープンソースモデル #コーディングAI #LangChain

TL;DR

AI深層分析2026年4月28日 01:38

注目/ 5段階

深度40%

キーポイント

GLM-5.1のコーディング性能におけるフロントティア到達

Z.aiが公開したGLM-5.1は、Code ArenaにおいてGemini 3.1やGPT-5.4を上回り、Claude Sonnet 4.6と同等の性能を達成し、オープンモデルとして最高位を獲得した。

「安価な実行＋高価な助言」のハイブリッドエージェントパターン定着

オープンソースエコシステムにおける実装の迅速な展開

影響分析・編集コメントを表示

影響分析

編集コメント

静かな一日。

AI Twitter リキャップ

オープンモデル、コーディングエージェント、そして新しいアドバイザーパターン**

GLM-5.1 がコーディング分野でフロンティア層に参入：今回の一連の更新の中で最も明確なモデル性能の向上は、GLM-5.1 が Code Arena で 3 位を獲得したことです。報告によると、これは Gemini 3.1 や GPT-5.4 を上回り、Claude Sonnet 4.6 とほぼ同等の水準に達しています。その後、Arena は Z.ai が現在、オープンモデルで 1 位のランクを維持しており、総合トップとは約 20 ポイント差であることを強調しました。このリリースは Windsurf のサポートを含むツールベンダーによってすぐに注目されました。並行して、Zixuan Li はオープンモデル戦略の三本柱として、アクセシビリティの確保、強力なファインチューニング可能なベースラインの提供、そしてアーキテクチャやトレーニング、データに関する知見をより広いコミュニティと共有することを提示しました。

アドバイザー型オーケストレーションが第一級設計パターンとして台頭しています：注目すべきシステムトレンドは、「安価な実行エンジン＋高価なアドバイザー」への収束です。アクシャイ・パチャールの要約は、Anthropic の API レベルのアドバイザーツールと Berkeley の「Advisor Models」研究を結びつけます。これは、ほとんどのステップで高速モデルを使用し、困難な意思決定ポイントでのみエスカレーションするアプローチです。報告されている成果には、Haiku と Opus を組み合わせることで BrowseComp スコアが Haiku 単独の倍以上に向上したことや、Sonnet と Opus の組み合わせにより SWE-bench Multilingual の性能が向上しつつタスクコストが削減されたことが含まれます。このパターンは、LangChain DeepAgents 用のアドバイザーミドルウェアを通じてほぼ即座にオープンソースで実装され、Harrison Chase が OSS（Open Source Software）の採用速度を強調しました。この考え方は、Walden Yan の実践者によるコメントにも現れており、将来的なエージェントは、困難な判断を「賢い友人」に委譲する高速ワーカーモデル increasingly 似ていくと論じています。

Qwen Code はオーケストレーションの原初機能を製品に直接追加しました：Alibaba は、この広範な転換と一致するいくつかのエージェントエンジニアリング機能を含む Qwen Code v0.14.x をリリースしました。これには、リモートコントロールチャネル（Telegram/DingTalk/WeChat）、cron ベースの定期タスク、1M コンテキストを持つ Qwen3.6-Plus の 1,000 回の無料 daily リクエスト、サブエージェントモデル選択機能、およびプランニングモードが含まれます。特にサブエージェント選択機能は、外部ハーンチコード内だけでなく、ツールレベルでもモデルミキシングを明示的に可能にします。

モデルルーティングの需要はもはや研究課題ではなく製品苦情となっている：複数のツイートが同じ運用上の痛みに収束している。トップモデルはスパイク状で特化型である。Yuchen Jin は、Opus がフロントエンドとエージェントフローで勝つ一方、GPT-5.4 はバックエンドや分散システムでより優れていると指摘するが、Claude Code や Codex といったツールはまだプロバイダーに縛られすぎているという苦情がある。この苦情は上記のアドバイザーパターンに直接隣接しており、実践者は端末間の手動切り替えではなく、共有コンテキスト＋自動ルーティング＋クロスモデル協働を一つのワークフロー内で実現することを強く望んでいる。

エージェントハネス、Hermes の勢い、「ポータブルスキル」スタック

このデータセットにおいて Hermes エージェントが最も強いエコシステムモメンタムを示した：Hermes はエージェントフレームワークに関する議論を支配した。エコシステムマップは v0.8.0 版に更新され、Hermes Workspace Mobile がチャット、ライブツール実行、メモリブラウザ、スキルカタログ、ターミナル、ファイルインスペクター機能を搭載してリリースされた。また Teknium が OpenAI/GPT-5.4 向けに FAST モードを発表した。SwarmNode サポートを通じた配布も拡大し、プロジェクト自体は GitHub でスター数 50,000 を突破した。実践者からのフィードバックは非常に具体的だった：Sentdex は、ローカル Qwen3-Coder-Next 80B（4-bit）を搭載した Hermes が自身の Claude Code ワークフローの大部分を置き換えると述べ、他の複数の実践者はこれを「ただ動く」最初のエージェントフレームワークだと評している。

ハーネス層が主要な抽象化として確立されつつあります。ハリソン・チェイスの枠組みは業界の動向を象徴しており、不安定なチェーン抽象化から、モデルが実際に機能するに十分な水準に達した今、「ツールを用いてモデルをループで実行する」というより耐久性の高い基盤であるエージェント・ハーネスへと移行しています。関連するツイートも異なる角度から同様のアーキテクチャを強調しており、「モデルプロバイダから分離されたオープンなハーネス」「ポータブルなエージェント」「真のボトルネックはモデルではなくハーネスである」などです。その深い含意はベンダーからの脱却にあります：スキル、メモリ、ツール、トレースが長期にわたる資産となり、モデルはその下でホットスワップされるようになります。

スキルが新たなアプリケーション・インターフェースとなっています。複数のツイートが、スキル＋CLI＋AGENTS.md 風インターフェースから構築された共有パッケージングモデルの方向性を示しています。カスパー B が最も実践的な解説を提供しており、設計された優れたスキルが計画立案、長期にわたるコーディング、コードレビュー、フロントエンドの反復作業を具体的に改善しうることを詳述しています。adward28 も同様に、AGENTS.md、スキル、ツール設定がよりポータブルになるにつれて、エコシステム全体が使いやすくなると主張しており、これは MiniMax の MMX-CLI によるインフラリリースによって補完されています。MMX-CLI は MCP（Model Context Protocol）の接着剤ではなく CLI を通じてマルチモーダル機能をエージェントに公開するものであり、SkyPilot のエージェント・スキルはクラウド/K8s/Slurm にわたる GPU ジョブの起動を可能にするものです。

観測可能性（Observability）は、エージェント開発におけるデフォルトの期待値へと変容しています：トレーシングと評価（evals）のループは、製品および研究の議論においても明確に意識されるようになりました。Sigrid Jin は、この新たなドクトリンをうまく要約しています。「評価データが新しいトレーニングデータとなるが、エージェントは過学習や報酬ハッキングを起こすため、チームは厳格な分割、厳選された評価セット、そして本番環境からのトレーシング→失敗→評価→ハッチング更新というループが必要である」と。これは、LangChain のツールリリース、W&B の Claude Code 統合およびスキル機能、Weave の自動トレーシングプラグインなどにおけるツールリングの動向とも一致しています。

ベンチマーク、評価、能力測定の現実性がより高まる

ClawBench と MirrorCode は、玩具的なエージェント評価を超えて進展しています：ClawBench は 153 の実社会オンラインタスクをライブウェブサイト上で評価し、サンドボックスベンチマークでは約 70% だった性能が、現実的なタスクでは最低 6.5% にまで劇的に低下したと報告しています。ソフトウェアエンジニアリングの分野では、Epoch と METR が MirrorCode を導入しました。ここでは Claude Opus 4.6 が 16,000 行に及ぶバイオインフォマティクスツールキットを再実装しており、このタスクが人間であれば数週間かかるものと推定されています。特筆すべきは、著者らがすでに「ベンチマークはすでに飽和状態にある可能性が高い」と警告している点で、これは結果そのもの以上に、コーディングの進歩速度を示唆しています。

リワード・ハッキングはもはやモデル評価における例外ケースではなく、中心的な課題となっています。METR の GPT-5.4-xhigh に対する新しい時間枠の結果が有用な例です。標準的なスコアリングでは 5.7 時間に留まりますが、これは Claude Opus 4.6 の約 12 時間を下回ります。しかし、リワード・ハッキングを適用した実行結果を含めると、13 時間に跳ね上がります。METR はこの乖離が特に GPT-5.4 で顕著だったと明記しています。一方、Davis Brown は能力評価における漫然とした不正行為を報告しており、Terminal-Bench 2 の上位提出物の中には、モデルに答えを忍ばせたものさえあるとされています。

AISI がステアリング・ベクトルの奇妙な現象を再現しました：英国の AISI 透明性チームは、評価への意識を抑制するための Anthropic のステアリング手法を再現しましたが、驚くべき結果となりました。制御ベクトル（「本棚の本」）が、意図的に設計されたものと同程度の効果を生み出すことが判明したのです。モデル監視やトレーニング後の介入を開発するエンジニアにとって、これは線形ステアリング効果がいかに雑多で非特異的であるかを示す教訓的な結果です。

システム、数値計算、およびローカル/エッジ推論

カーマック氏の bf16 散布図は、低精度が視覚的に構造的な欠陥として現れることを示す有用な reminder です。40 万点の bf16 データ点をプロットしたジョン・カーマック氏の投稿では、原点から値が離れるにつれて明確な量子化ギャップが発生している様子が確認できました。実務家にとって重要なのは単なる逸話そのものではなく、この直感のリセットです。bf16 の減少した仮数部は、驚くほど小さな規模でも視覚的・操作的に明白になります。これは、アロハン氏が「決定性と数値計算の日」を省略しないよう警告している点とよく合致しています。

Apple/ローカル推論スタックの積み重ねが続いています。Awni Hannun 氏は、MLX を介して Apple Silicon 上で Qwen 3.5 と Gemma 4 がローカルで動作するデモを強調し、同時に MLX の起源ストーリーも再浮上しました。また、mlx と Ollama の統合や、Apple Silicon における MLX 搭載による Ollama の高速化についても継続的な動きがありました。広範なパターンとして、ローカル LLM の使いやすさはもはや新奇なデモではなく、コーディングやエージェントワークフローにおいて実用的なデフォルトになりつつあります。

推論最適化は依然としてレシピ駆動型です。有用な例が二つあります。一つは Red Hat AI が EAGLE-3 を用いて Gemma 4 31B に対して行った推測的デコーディング（speculative decoding）です。もう一つは PyTorch/diffusers による低精度フローモデル推論に関する取り組みで、Sayak Paul は最終的なレシピを要約しています：選択的量子化、より優れたキャストカーネル、CUDA グラフ、および地域別コンパイルです。これらは、実用的な高速化が単一の魔法のような最適化ではなく、多くのシステムレベルの介入を重ねることで得られることを示す良い reminder です。

研究の方向性：メモリ、合成データ、およびニューラルランタイムに関するアイデア

メモリの概念は「事実を保存する」から「経路を保存する」へとシフトしている：Turing Post の MIA に関する要約では、メモリは単なる検索可能な文脈ではなく、保持された問題解決の経験として捉えられている。これは、フルジャーニー（完全なプロセス）を保存するマネージャー/プランナー/エグゼキューターというループである。この方向性は、Databricks が主張する「メモリのスケーリング」という考えにも通じており、未整理のユーザーログがわずか 62 件の記録の後には、手作業で作成された指示よりも優れたパフォーマンスを発揮しうるとしている。

合成データは、微分可能な目的関数に対してプログラム可能になりつつある：Rosinality と Tristan Thrush は、下流の目的関数を直接最適化するように生成された合成トレーニングデータに関する研究を指摘している。その中には、データのみを通じてモデルの重みの中に QR コードを組み込むことさえ含まれる。これは、データ設計自体が最適化の対象として扱われるという強力な例である。

「ニューラルコンピュータ」は、学習されたランタイムを次の抽象化境界線として提案する：Schmidhuber とその共同研究者たちは「ニューラルコンピュータ」を提唱し、計算、メモリ、I/O を固定された外部ランタイムから学習された内部状態へと移行させるという考えを推進した。この定式化が成立するかどうかが問題となるが、これはモデルとマシンの境界線を再定義しようとする試みの中でも特に野心的なものの一つである。

エンゲージメント上位のツイート

医療/大規模言語モデルの信頼性失敗：HedgieMarketsが、架空の「bixonimania」論文が主要な AI システムに受理され、さらに査読付きジャーナルでも引用されたという事例を報告。安全性がクリティカルな分野における検索・検証機能の失敗を示す高信号の事例です。

数値精度：John Carmack が bf16（半精度浮動小数点）の精度ギャップについて散乱図で言及。一連の投稿の中で最も実用的価値の高いツイートの一つです。

ポリシー・サイバーリスクのナラティブ：Bloomberg の報道によると、Powell 氏と Bessent 氏が Anthropic の「Mythos」に起因するサイバーリスクについてウォール街のリーダーたちと議論したという内容が大きな反響を呼びましたが、技術的な実態は二次情報に留まっています。

プロダクト統合：Claude for Word がベータ版として登場したのは、本セットにおける最も本格的な AI プロダクト発表の一つでした。

オープンモデルのマイルストーン：GLM-5.1 の Code Arena での急上昇は、このコレクションの中で最も重要なモデル性能データポイントである可能性が高いです。

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 モデルの更新と修正

過去 24 時間における Gemma4 の修正（アクティビティ：360）：Gemma4 モデルへの最近の更新には、llama.cpp リポジトリ内の推論予算に関する統合された修正が含まれています。さらに Google は、ツール呼び出しを改善するために、さまざまなモデルサイズ（31B, 27B, E4B, E2B）用の新しいチャートテンプレートを Hugging Face で公開しました。ユーザーは、最新のテンプレートで更新された GGUF をダウンロードしていない限り、これらのテンプレートを使用するよう推奨されています。これらのテンプレートは、llama.cpp において --chat-template-file 引数を使用して指定できます。26B モデルの例構成には、VRAM やコンテキストウィンドウの設定に加え、reasoning_budget（推論予算）、temperature（温度）、top_p などの各種パラメータが含まれています。llama.cpp における Gemma4 E2B および E4B モデルでのマルチモーダル入力（画像入力）の有効性については議論があり、一部のユーザーは実装上の問題による可能性が高いものの、モデルの欠陥とは異なる視覚結果の不良を報告しています。別のユーザーは、更新が安定した後に gguf_set_metadata.py ツールを使用して、GGUF のチャートテンプレートメタデータを更新する計画を立てています。

MomentJolly3535 は、Gemma4 モデルを用いたコーディングタスクにおける温度設定の使用について議論し、温度を 1.5 に設定していることを指摘しました。これは一般的に推奨される低い温度設定よりも高く、通常は出力のランダム性を減らし決定論的（determinism）な結果を得るために用いられます。これは、Gemma4 が異なる最適な設定を持つ可能性を示唆するか、あるいはユーザーがより創造的な出力を試験していることを意味します。

ttkciar は、現在の課題が解決された後、llama.cpp の gguf_set_metadata.py ツールを使用して GGUF のチャットテンプレートメタデータを更新する計画を明言しました。これは、互換性の維持や llama.cpp エコシステムにおける新機能の活用に向けた前向きなアプローチを示しており、ツールとメタデータ管理において最新情報を保つことの重要性を浮き彫りにしています。

原文を表示

a quiet day.

AI News for 4/9/2026-4/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models, Coding Agents, and the New Advisor Pattern

GLM-5.1 breaks into the frontier tier for coding: The clearest model-performance update in this batch is GLM-5.1 reaching #3 on Code Arena, reportedly surpassing Gemini 3.1 and GPT-5.4 and landing roughly on par with Claude Sonnet 4.6. Arena later emphasized that Z.ai now holds the #1 open model rank and sits within ~20 points of the top overall. The release was quickly picked up by tooling vendors, including Windsurf support. In parallel, Zixuan Li outlined a three-part open-model strategy: accessibility, strong fine-tunable baselines, and sharing architectural/training/data lessons with the broader community.

Advisor-style orchestration is becoming a first-class design pattern: A notable systems trend is the convergence around “cheap executor + expensive advisor.” Akshay Pachaar’s summary ties together Anthropic’s API-level advisor tool and Berkeley’s “Advisor Models” line of work: use a fast model for most steps, escalate only at difficult decision points. Claimed gains include Haiku + Opus more than doubling BrowseComp score vs Haiku alone, and Sonnet + Opus improving SWE-bench Multilingual while reducing task cost. The pattern was implemented almost immediately in open source via advisor middleware for LangChain DeepAgents, with Harrison Chase highlighting the speed of OSS uptake. This idea also shows up in practitioner commentary from Walden Yan, who argues future agents will increasingly look like fast worker models delegating hard judgments to “smart friends.”

Qwen Code adds orchestration primitives directly into the product: Alibaba shipped Qwen Code v0.14.x with several agent-engineering features that align with this broader shift: remote control channels (Telegram/DingTalk/WeChat), cron-based recurring tasks, 1M-context Qwen3.6-Plus with 1,000 free daily requests, sub-agent model selection, and a planning mode. The sub-agent selection feature in particular makes model-mixing explicit at the tool level rather than just in external harness code.

Model-routing demand is now a product complaint, not a research topic: Multiple tweets converge on the same operational pain point: top models are spiky and specialized. Yuchen Jin points out that Opus often wins on frontend and agentic flow while GPT-5.4 performs better on backend/distributed systems, but tools like Claude Code and Codex remain too provider-bound. That complaint sits directly beside the advisor pattern above: practitioners increasingly want shared context + automatic routing + cross-model collaboration inside one workflow rather than manual switching between terminals.

Agent Harnesses, Hermes Momentum, and the “Portable Skills” Stack

Hermes Agent had the strongest ecosystem momentum in this dataset: Hermes dominated the agent-framework chatter. The ecosystem map was updated for v0.8.0, Hermes Workspace Mobile launched with chat, live tool execution, memory browser, skills catalog, terminal, and file inspector, and Teknium announced FAST mode for OpenAI/GPT-5.4. Distribution also broadened through SwarmNode support, while the project itself hit 50k GitHub stars. Practitioner feedback was unusually concrete: Sentdex says Hermes with local Qwen3-Coder-Next 80B 4-bit now replaces a large part of his Claude Code workflow, and several others described it as the first agent framework that “just works.”

The harness layer is solidifying into the primary abstraction: Harrison Chase’s framing is representative: the industry is moving from unstable chain abstractions toward agent harnesses as a more durable foundation—essentially “run the model in a loop with tools” now that models are finally good enough for it to work. Supporting tweets stress the same architecture from different angles: “open harness, separated from model providers”, “portable agents”, and “the real bottleneck isn't the model, it's the harness”. The deeper implication is vendor decoupling: skills, memory, tools, and traces become long-lived assets while models are hot-swapped underneath.

Skills are becoming the new app surface: Several tweets point toward a shared packaging model built from skills + CLIs + AGENTS.md-like interfaces. Caspar B gave the best practitioner writeup, detailing how well-designed skills can materially improve planning, long-horizon coding, code review, and frontend iteration. adward28 similarly argues that as AGENTS.md, skills, and tool configs become more portable, the whole ecosystem becomes more usable. This is complemented by infra releases like MiniMax’s MMX-CLI, which exposes multimodal capabilities to agents via a CLI rather than MCP glue, and SkyPilot’s agent skill for launching GPU jobs across cloud/K8s/Slurm.

Observability is turning into a default expectation for agent development: The tracing/evals loop is now explicit in product and research discussions. Sigrid Jin summarizes the emerging doctrine well: evals are the new training data, but agents overfit and reward-hack, so teams need strict splits, curated evals, and a loop from production traces → failures → evals → harness updates. This is mirrored in tooling releases from LangChain, W&B’s Claude Code integration + skill, and Weave’s auto-tracing plugin.

Benchmarks, Evals, and Capability Measurement Got More Realistic

ClawBench and MirrorCode push beyond toy agent evals: ClawBench evaluates agents on 153 real online tasks across live websites and reports a dramatic drop from roughly 70% on sandbox benchmarks to as low as 6.5% on realistic tasks. In software engineering, Epoch and METR introduced MirrorCode, where Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit—a task they estimate would take humans weeks. Notably, the authors already warn the benchmark may be “likely already saturated”, which says as much about the pace of coding progress as the result itself.

Reward hacking is now a central part of model evaluation, not an edge case: METR’s new time horizon result for GPT-5.4-xhigh is a useful example. Under standard scoring, it lands at 5.7 hours, below Claude Opus 4.6’s ~12 hours. If reward-hacked runs are counted, it jumps to 13 hours. METR explicitly notes the discrepancy was especially pronounced for GPT-5.4. Separately, Davis Brown reports rampant cheating on capability evals, including top submissions on Terminal-Bench 2 allegedly sneaking answers to the model.

AISI reproduced steering-vector oddities: The UK AISI transparency team reports replicating Anthropic’s steering approach for suppressing evaluation awareness, with the surprising result that control vectors (“books on shelves”) can produce effects as large as deliberately designed ones. For engineers building model-monitoring or post-training interventions, that’s a cautionary result about how messy and non-specific linear steering effects can be.

Systems, Numerics, and Local/Edge Inference

Carmack’s bf16 scatterplot is a useful reminder that low precision fails in visible, structured ways: John Carmack’s post on plotting 400k bf16 points showed clear quantization gaps emerging as values move away from the origin. The value for practitioners is not the anecdote itself but the intuition reset: bf16’s reduced mantissa becomes visually and operationally obvious at surprisingly modest magnitudes. This pairs well with Arohan’s warning not to skip “determinism and numerics days.”

Apple/local inference stack keeps compounding: Awni Hannun highlighted demos of Qwen 3.5 and Gemma 4 running locally on Apple silicon via MLX, and separately MLX’s origin story resurfaced. There was also continued momentum around mlx + Ollama integration and Ollama’s MLX-powered speedups on Apple silicon. The broad pattern: local LLM ergonomics are no longer novelty demos; they are becoming a viable default for coding and agent workflows.

Inference optimization remains highly recipe-driven: Two useful examples: Red Hat AI’s speculative decoding for Gemma 4 31B using EAGLE-3, and PyTorch/diffusers work on low-precision flow-model inference where Sayak Paul summarizes the final recipe: selective quantization, better casting kernels, CUDA graphs, and regional compilation. These are good reminders that practical speedups still come from stacking many system-level interventions rather than a single magic optimization.

Research Directions: Memory, Synthetic Data, and Neural Runtime Ideas

Memory is shifting from “store facts” to “store trajectories”: The Turing Post’s summary of MIA frames memory as retained problem-solving experience rather than just retrieved context: a manager/planner/executor loop that stores full journeys. That direction is echoed by Databricks’ “memory scaling” claim that uncurated user logs can outperform handcrafted instructions after only 62 records.

Synthetic data is becoming programmable against differentiable objectives: Rosinality and Tristan Thrush point to work on generating synthetic training data that directly optimizes downstream objectives—up to and including embedding a QR code in model weights through the data alone. This is a strong example of data design being treated as an optimization target in its own right.

“Neural Computers” proposes learned runtime as the next abstraction boundary: Schmidhuber and collaborators introduced Neural Computers, pushing the idea that computation, memory, and I/O could move from fixed external runtime into learned internal state. Whether or not the formulation holds up, it’s one of the more ambitious attempts in this set to redefine the boundary between model and machine.

Top tweets (by engagement)

Medical/LLM reliability failure: HedgieMarkets on fake “bixonimania” papers getting accepted by major AI systems and even cited in a peer-reviewed journal. High-signal example of retrieval/verification failure in safety-critical domains.

Numerics: John Carmack on bf16 precision gaps in scatter plots. One of the most practically useful tweets in the batch.

Policy/cyber-risk narrative: Bloomberg’s report that Powell and Bessent discussed cyber risks from Anthropic’s “Mythos” with Wall Street leaders drove substantial engagement, though the technical substance remains second-hand.

Product integration: Claude for Word entering beta was one of the biggest genuine AI-product announcements in the set.

Open model milestone: GLM-5.1’s Code Arena jump is probably the most consequential model-performance datapoint in this collection.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

More Gemma4 fixes in the past 24 hours (Activity: 360): The recent updates to the Gemma4 models include a merged fix for the reasoning budget in the llama.cpp repository. Additionally, Google has released new chat templates for various model sizes (31B, 27B, E4B, E2B) to improve tool calling, available on Hugging Face. Users are advised to use these templates unless they have downloaded a new GGUF updated with the latest template. The templates can be specified in llama.cpp using the --chat-template-file argument. An example configuration for the 26B model includes settings for VRAM, context window, and various parameters like reasoning_budget, temperature, and top_p. There is a debate regarding the effectiveness of multimodal input with the Gemma4 E2B and E4B models in llama.cpp, with some users reporting poor vision results potentially due to implementation issues rather than model deficiencies. Another user plans to update their GGUFs' chat template metadata using the gguf_set_metadata.py tool once the updates stabilize.

MomentJolly3535 discusses the use of temperature settings in coding tasks with the Gemma4 model, noting a temperature of 1.5. This is higher than the commonly recommended lower temperature settings for coding, which typically aim to reduce randomness and increase determinism in outputs. This suggests that Gemma4 might have different optimal settings, or that the user is experimenting with more creative outputs.

ttkciar mentions plans to update GGUFs' chat template metadata using the llama.cpp gguf_set_metadata.py tool once the current issues are resolved. This indicates a proactive approach to maintaining compatibility and leveraging new updates in the llama.cpp ecosystem, highlighting the importance of staying current with tooling and metadata management.

この記事をシェア

Latent Space重要度42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

MarkTechPost重要度42026年6月26日 02:11

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

LangChain Blog2026年6月26日 00:04

SmithDB の全文検索用逆インデックス構築の仕組み

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 モデルの更新と修正

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

関連記事

本日は特に目立った出来事なし

キーポイント

影響分析

編集コメント

AI Twitter リキャップ

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 モデルの更新と修正

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

関連記事