LangChain Blog·2026年5月4日 22:17·約13分

オープンモデルが新たな閾値を突破

#LLM #エージェント #オープンソースモデル #ツール呼び出し #GLM-5 #MiniMax

TL;DR

LangChain の評価により、GLM-5 や MiniMax M2.7 などのオープンモデルが、ファイル操作やツール使用といったコアエージェントタスクにおいてクローズドな最前線モデルと同等の性能を発揮することが確認された。

AI深層分析2026年7月5日 08:10

重要/ 5段階

深度40%

キーポイント

オープンモデルの実用性確立

Deep Agents ハンズオン評価により、GLM-5 や MiniMax M2.7 がファイル操作やツール呼び出しにおいてクローズドモデルと同等のスコアを記録し、実運用への代替候補として確立された。

コストとレイテンシの劇的改善

性能が同等であるにもかかわらず、オープンモデルはクローズドモデルと比較して大幅に低いコストと短いレイテンシを実現しており、開発者にとって魅力的な選択肢となっている。

信頼性の高いツール呼び出し

SWE-Rebench や Terminal Bench 2.0 などのベンチマークでも示唆されている通り、オープンモデルのツール呼び出し機能と指示従順性はすでに安定しており、生産環境での使用が可能である。

影響分析・編集コメントを表示

影響分析

このニュースは、AI エージェント開発における「オープン vs クローズド」の対立構造を再定義する重要な転換点です。高性能かつ低コストなオープンモデルが実用レベルに達したことで、企業や開発者はベンダーロックインのリスクを減らしつつ、スケーラビリティと経済性を両立させる戦略が可能になります。これにより、AI エージェントの普及スピードが加速し、より多様なユースケースでの採用が進むことが予想されます。

編集コメント

クローズドモデルの壁が崩れつつあり、開発現場ではもはや「オープンかクローズドか」ではなく、「タスクとコストに最適なモデルをどう組み合わせるか」という戦略的選択が問われる時代に入りました。

image

主要なポイント

TL;DR: GLM-5 や MiniMax M2.7 といったオープンモデルは、ファイル操作、ツール使用、指示の遵守といったコアエージェントタスクにおいて、クローズドな最前線モデルと同等の性能を発揮します。しかも、そのコストとレイテンシは大幅に低減されています。ここでは評価結果の詳細と、Deep Agents での活用方法について解説します。

ここ数週間にわたり、Deep Agents のハーンネス評価を通じてオープンウェイトの大型言語モデル（LLM）をテストしてきました。その初期結果から、これらはクローズドな最前線モデルに代わる選択肢として、あるいは併用する選択肢として十分に実用的であることが示されました。GLM-5 (z.ai) と MiniMax M2.7 は、ファイル操作、ツール使用、指示の遵守といったコアエージェントタスクにおいて、クローズドな最前線モデルと同等のスコアを記録しています。

SWE-Rebench や Terminal Bench 2.0 といった多数のオープンベンチマークを通じてオープンモデルの進展を追ってきた方々にとっては、これは驚くべきことではありません。ツール呼び出しは信頼性が高く、指示の遵守も一貫しています。生産環境でエージェントをデプロイする開発者にとって、オープンモデルはすでに、現実世界のワークフローをより実現可能にするレベルの一貫性と予測可能性を提供しています。

なぜオープンモデルなのか

オープンモデルを探求する際、ビルダーや顧客は主にいくつかの重要な要素に注目します：コスト、レイテンシ、そしてタスクパフォーマンスです。

理想的には、あらゆるタスクに対して最も賢いフロンティアモデルを最高レベルの推論能力で利用したいものです。しかし実際には、コストとレイテンシという 2 つの制約により、それは現実的ではありません。クローズドなフロンティアモデルは、高スループットワークロードにおいてオープンモデルに比べて 8〜10 倍も高額になりがちです。また、インタラクティブな製品でユーザーが期待する応答速度に対して、しばしば遅すぎます。

Model Type Input ($/M tokens) Output ($/M tokens)

Claude Opus 4.6 (Anthropic) Closed $5.00 $25.00

Claude Sonnet 4.6 (Anthropic) Closed $3.00 $15.00

GPT-5.4 (OpenAI) Closed $2.50 $15.00

GLM-5 (Baseten) Open $0.95 $3.15

MiniMax M2.7 (OpenRouter) Open $0.30 $1.20

*価格の文脈を補足すると：1 日に 1,000 万トークンを出力するアプリケーションの場合、Opus 4.6 では約 250 ドル/日かかるのに対し、MiniMax M2.7 では約 12 ドル/日で済みます。これは年間約 87,000 ドルの差になります。*

オープンモデルはクローズドのフロンティアモデルに比べて小型である傾向があり、Groq、Fireworks、Baseten といったプロバイダーが提供する専用推論インフラ上で高速化が可能です。これらのプロバイダーは、多くのチームが独自で達成できる水準を遥かに超えるレイテンシとスループットを実現するために最適化されています。OpenRouter のデータ [https://openrouter.ai/z-ai/glm-5/performance?ref=blog.langchain.com] によると、Baseten 上で動作する GLM-5 は平均レイテンシが 0.65 秒、スループットが 1 秒あたり 70 トークンであるのに対し、Claude Opus 4.6 はそれぞれ 2.56 秒、34 トークンです。レイテンシに敏感な製品においては、この差を工学的に埋めることは困難です。

評価方法について

Deep Agents 向けの評価（eval）構築方法については、How we build evals for Deep Agents で詳細に解説しています。私たちはホストされた推論プロバイダーを使用して評価を実行していますが、Ollama や vLLM 等を用いることで、Deep Agents は完全にローカルかつプライベートなモデル上でも実行可能です。

オープンモデルについては、ファイル操作、ツール使用、検索（retrieval）、会話、メモリ、要約、そして「ユニットテスト」の 7 つの評価カテゴリを実行しました。これらは、モデルが基本的な機能を果たせるかを検証するタスクを網羅しています。具体的には、モデルが確実にツールを呼び出せるか、構造化された指示に従えるか、ファイルを操作できるかなどです。これらの能力こそが、モデルがエージェントハネス（agentic harness）で実際に使用可能かどうかを決定するゲートキーパーとなります。

各評価ケースは、成功の主張（正しさを決定する厳格な失敗チェック）と効率性の主張（モデルがどのようにそこに到達したかを測定する緩やかなチェック）を定義しています。私たちは4つの指標を報告します:

正確性 — モデルが解決したテストの割合：合格数 / 総数。スコア0.68とは、テストケースの68%が正しく解決されたことを意味します。これは主要な品質シグナルです。
ソルブレート（Solve rate）— 精度と速度を組み合わせた測定値。各テストについて、期待されるステップ数を壁時計秒数で割って計算します（失敗したテストは0として扱います）。最終スコアは全テストにわたる平均値です。高いほど良く、タスクを正しくかつ迅速に解決するモデルが最高スコアを獲得します。
ステップ比（Step ratio）— モデルが実際に要したエージェントステップ数と、私たちが期待していたステップ数の比率を全テストで集約したものです：総実ステップ数 / 総期待ステップ数。値が1.0の場合、モデルは期待された正確な数のステップを使用しました。1.0より大きい場合は、より多くのステップが必要だった（非効率）、1.0より小さい場合は、当初の予想よりも少ないステップで済んだことを意味します。
ツール呼び出し比（Tool call ratio）— ステップ比と同じ考え方ですが、ステップではなく個々のツール呼び出しをカウントします。1.0は予算内、それ以上は予算超過、それ以下は予算未使用となります。

ステップ比とツール呼び出し比は*効率性*指標です。テストの合格・不合格には影響しませんが、モデルがどの程度経済的に回答に到達するかを示すものです。2ステップで解決するモデルは、期待された5ステップと比較して、正確であるだけでなく効率的でもあります。

評価からの知見

これらは初期結果であり、私たちは評価セットの維持と拡張を積極的に進めています。最近の実行状況は、GitHub リポジトリと共有された LangSmith プロジェクトの両方でリアルタイムで確認できます。

オープンモデル

CI 実行の表示（モデル名をクリックすると個別の評価詳細が表示されます）

モデル

正答率

合格数

解決率

ステップ比

ツール呼び出し比

baseten:zai-org/GLM-5

0.64

138 件中 94

1.17

1.02

1.06

ollama:minimax-m2.7

0.57

138 件中 85

0.27

1.02

1.04

image

カテゴリ別正答率:

モデル

会話

ファイル操作

メモリ

検索

要約

ツール使用

ユニットテスト

baseten:zai-org/GLM-5

0.38

0.44

0.6

0.82

ollama:minimax-m2.7:cloud

0.14

0.92

0.38

0.8

0.6

0.87

0.92

フロンティアモデル

CI ランを実行（モデル名をクリックして個別の評価結果を表示）

モデル

正答率

合格数

解決率

ステップ比

ツール呼び出し比

anthropic:claude-opus-4-6

0.68

138件中 100件

0.38

0.99

1.02

google_genai:gemini-3.1-pro-preview

0.65

138件中 96件

0.26

0.99

1.01

openai:gpt-5.4

0.61

138件中 91件

0.61

1.05

1.15

image

カテゴリ別正答率:

モデル

会話

ファイル操作

メモリ

検索

要約

ツール使用

ユニットテスト

anthropic:claude-opus-4-6

0.05

0.67

0.87

google_genai:gemini-3.1-pro-preview

0.24

0.92

0.62

0.8

0.79

0.92

各モデルについては、プロバイダーのデフォルト思考レベルを使用することを選択します。Gemini 3+ の場合、これは「高」です。OpenAI の場合、「中」です。Claude の場合、拡張思考なしです。

DIY: ローカルで Deep Agent 評価を実行する

私たちの CI（継続的インテグレーション）は、52 のモデルにわたる同じ評価スイートを実行します。これらはグループ化されており、その中にはオープングループも含まれています（baseten:zai-org/GLM-5, ollama:minimax-m2.7:cloud, ollama:nemotron-3-super）。このグループはすべての評価ワークフローで実行されます。任意のモデルグループをターゲットにできます：

すべてのオープンモデルに対して評価を実行する

pytest tests/evals --model-group open

特定のモデルに対して実行する

pytest tests/evals --model baseten:zai-org/GLM-5

これにより、同じタスクにおいて、同じ評価基準を用いて、オープンモデル同士やクローズドな最前線モデルとの比較が容易になります。

Deep Agents SDK でオープンモデルを使用する

オープンモデルへの切り替えは、1 行の変更で完了します：

GLM-5:

pip install langchain-baseten

from deepagents import create_deep_agent

agent = create_deep_agent(model="baseten:zai-org/GLM-5")

MiniMax M2.7:

pip install langchain-openrouter

from deepagents import create_deep_agent

agent = create_deep_agent(model="openrouter:minimax/minimax-m2.7")

これで完了です。ハーンレスは残りの処理を担い、モデルのコンテキストウィンドウサイズを検出し、サポートされていないモダリティを無効化し、システムプロンプトに適切なアイデンティティを注入して、エージェントが何を取り扱っているかを認識できるようにします。

同じオープンモデルは、複数のプロバイダーを通じて利用可能な場合がほとんどです。ご自身の制約条件に合致するものをお選びください。例えば、GLM-5 は baseten:zai-org/GLM-5、fireworks:fireworks/glm-5、またはセルフホスト用の ollama:glm-5 として提供されています。同じモデルで、同じハーンレスですが、インフラストラクチャは異なります。

LangChain は、最も人気のあるオープンモデルプロバイダーのサポートを提供しています。今回のリリースでテスト済みのプロバイダーは以下の通りです：Baseten、Fireworks、Groq、OpenRouter、および Ollama（クラウド版）。

モデルごとのハーンレベルでの調整

オープンモデルは、クローズドなフロンティアモデルとは異なり、コンテキストウィンドウ、ツール呼び出しのフォーマット、そして失敗モードが異なります。Deep Agents のハーンレスはこれらの違いを吸収するため、ユーザーが気にする必要はありません：

モデルアイデンティティの注入 — システムプロンプトはランタイムでパッチされ、モデル名、プロバイダー、コンテキスト制限、およびサポートされるモダリティが含まれます。エージェント自身が何者であり、何をできるかを認識します。
コンテキスト管理 — 圧縮、オフローディング、要約の閾値は、ハードコードされたデフォルトではなく、モデルの実際のコンテキストウィンドウに合わせて適応されます。4K のコンテキストを持つモデルは、1M のコンテキストを持つ Opus よりもより積極的な圧縮を行います。

Deep Agents CLI

各モデルは、Deep Agents CLI でも利用可能です。Deep Agents CLI は、LangChain AI が提供するオープンソースのコーディングエージェントであり、Claude Code の代替手段です。

Deep Agents SDK に含まれるすべての機能に加え、CLI ではランタイムでのモデル切り替えをサポートしています。セッション中にエージェントを再起動せずにモデルを切り替えることを可能にする新しいミドルウェア（ConfigurableModelMiddleware）を導入しました。これにより、計画には最先端モデルを、実行にはオープンソースモデルを使用するといったパターンが可能になります。

/model スラッシュコマンドを使用して、セッション中にモデルを切り替えることができます。これにより、計画には最先端モデルでタスクを開始し、実行にはより安価なオープンソースモデルに切り替えるといったパターンを実現できます。

今後の展望

近日公開予定のいくつかの話題をご紹介します：

特定のオープンソースモデルファミリー向けのハッチング調整パターンのドキュメント化
マルチモデルサブエージェント構成のテスト（例：最先端クローズドモデルによるオーケストレーターと、オープンソースモデルによるサブエージェント）

オープンソースモデルは、今日のエージェントでも十分に機能します。私たちは、優れたハッチを構築し、タスクにとって重要なものを測定するターゲット型の評価を構築するために役立つ設計パターンを示すことを目指しています。

Deep Agents はオープンソースです。お好みのオープンモデルでお試しください。私たちと一緒に素晴らしい評価（evals）やエージェントを構築しましょう。

異なるモデルとよく動作するように Deep Agents をチューニングする

image

V. Trivedy,

M. Daugherty

2026 年 4 月 29 日

image

5 分

image.png)

エージェントアーキテクチャ（Agent Architecture）

LangSmith

オープンソース

EU AI 法（EU AI Act）の要件を満たすために LangSmith と LangChain OSS がどのように役立つか

image

J. Talbot,

B. Weng

2026 年 4 月 27 日

image

7 分

image

Conceptual Guide

Deep Agents

本番環境のディープエージェントを支えるランタイム

image

S. Runkle,

V. Trivedy

2026 年 4 月 20 日

image

24 分

image

エージェントが実際に何をしているかを確認する

LangSmith は、開発者がエージェントのすべての意思決定をデバッグし、変更の評価を行い、ワンクリックでデプロイできるためのエージェントエンジニアリングプラットフォームです。

原文を表示

Key Takeaways

TL;DR: Open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks — file operations, tool use, and instruction following — at a fraction of the cost and latency. Here's what our evals show and how to start using them in Deep Agents.

Over the past few weeks, we’ve been running open weight Large Language Models through Deep Agents harness evaluations, and the initial results show they are a viable option to use instead of, and alongside, closed frontier models. GLM-5 (z.ai) and MiniMax M2.7 each score similarly to closed frontier models on core agent tasks such as file operations, tool use, and instruction following.

This isn’t surprising if you’ve been following open model progress via the large set of open benchmarks such as SWE-Rebench and Terminal Bench 2.0. Tool calling is reliable and instruction following is consistent. For developers deploying agents in production, open models now offer a level of consistency and predictability that makes real-world workflows much more viable.

Why open models

When exploring open models, builders and customers tend to focus on a few key factors: cost, latency, and task performance.

In the limit, it would be great to use the smartest frontier model at the highest reasoning level for every task. In practice, two constraints make that unworkable: cost and latency. Closed frontier models can run 8–10x more expensive for high-throughput workloads, and they're often too slow for the response times users expect in interactive products.

Model

Type

Input ($/M tokens)

Output ($/M tokens)

Claude Opus 4.6 (Anthropic)

Closed

$5.00

$25.00

Claude Sonnet 4.6 (Anthropic)

Closed

$3.00

$15.00

GPT-5.4 (OpenAI)

Closed

$2.50

$15.00

GLM-5 (Baseten)

Open

$0.95

$3.15

MiniMax M2.7 (OpenRouter)

Open

$0.30

$1.20

*To put the pricing in context: an application outputting 10M tokens/day costs roughly $250/day on Opus 4.6 versus ~$12/day for MiniMax M2.7. That's about a $87k annual difference.*

Open models tend to be smaller than closed frontier models, and can be accelerated on specialized inference infrastructure — providers like Groq, Fireworks, and Baseten optimize for latency and throughput far beyond what most teams could achieve on their own. OpenRouter data show GLM-5 on Baseten averaging 0.65s latency and 70 tokens/second, compared to 2.56s and 34 tokens/second for Claude Opus 4.6. For latency-sensitive products, that gap is hard to engineer around.

How we evaluated

We've written about our eval methodology in depth in How we build evals for Deep Agents. We run evals using hosted inference providers, but Deep Agents can be run using fully local and private models via Ollama, vLLM, etc.

For open models, we ran seven eval categories: file operations, tool use, retrieval, conversation, memory, summarization, and “unit tests”. These cover tasks that exercise fundamentals: can the model reliably call tools, follow structured instructions, and operate on files? These are the capabilities that gate whether a model is usable in an agentic harness at all.

Each eval case defines success assertions (hard-fail checks that determine correctness) and efficiency assertions (soft checks that measure how the model got there). We report four metrics:

Correctness — the fraction of tests the model solved: passed / total. A score of 0.68 means 68% of test cases were solved correctly. This is the primary quality signal.
Solve rate — a combined measure of accuracy and speed. For each test, we compute expected_steps / wall_clock_seconds; failed tests contribute zero. The final score is the average across all tests. Higher is better — a model that solves tasks both correctly and quickly scores highest.
Step ratio — how many agentic steps the model actually took compared to how many we expected, aggregated across all tests: total_actual_steps / total_expected_steps. A value of 1.0 means the model used exactly the expected number of steps. Above 1.0 means it needed more (less efficient); below 1.0 means it needed fewer steps than initially expected.
Tool call ratio — same idea as step ratio, but counting individual tool calls instead of steps. 1.0 is on-budget, above is over-budget, below is under-budget.

Step ratio and tool call ratio are *efficiency* metrics. They don't affect whether a test passes, but they reveal how economically a model reaches the answer. A model that solves a task in 2 steps instead of the expected 5 is both correct *and* efficient.

Findings from our evals

These are early results; we’re actively maintaining and expanding our eval set. You can view recent runs in realtime both in our GitHub repo and at this shared LangSmith project.

Open models

View CI run (click model names to view individual evals)

Model

Correctness

Passed

Solve Rate

Step Ratio

Tool Call Ratio

baseten:zai-org/GLM-5

0.64

94 of 138

1.17

1.02

1.06

ollama:minimax-m2.7

0.57

85 of 138

0.27

1.02

1.04

Per-category correctness:

Model

Conversation

File Ops

Memory

Retrieval

Summarization

Tool Use

Unit Test

baseten:zai-org/GLM-5

0.38

0.44

0.6

0.82

ollama:minimax-m2.7:cloud

0.14

0.92

0.38

0.8

0.6

0.87

0.92

Frontier models

View CI run (click model names to view individual evals)

Model

Correctness

Passed

Solve Rate

Step Ratio

Tool Call Ratio

anthropic:claude-opus-4-6

0.68

100 of 138

0.38

0.99

1.02

google_genai:gemini-3.1-pro-preview

0.65

96 of 138

0.26

0.99

1.01

openai:gpt-5.4

0.61

91 of 138

0.61

1.05

1.15

Per-category correctness:

Model

Conversation

File Ops

Memory

Retrieval

Summarization

Tool Use

Unit Test

anthropic:claude-opus-4-6

0.05

0.67

0.87

google_genai:gemini-3.1-pro-preview

0.24

0.92

0.62

0.8

0.79

0.92

openai:gpt-5.4

0.29

0.44

0.8

0.76

‍*For each model, we opt to use the provider’s default thinking level.*‍*For Gemini 3+, this is high*‍*For OpenAI, this is medium*‍*For Claude, this is without extended thinking*

DIY: Run Deep Agent evals locally

Our CI runs the same evaluation suite across 52 models organized into groups — including an open group (

code

baseten:zai-org/GLM-5

code

ollama:minimax-m2.7:cloud

code

ollama:nemotron-3-super

) that runs on every eval workflow. You can target any model group:

code

Run evals against all open models

pytest tests/evals --model-group open

Run against a specific model

pytest tests/evals --model baseten:zai-org/GLM-5

code

This makes it straightforward to compare open models against each other and against closed frontier models on the same tasks, using the same grading criteria.

Using open models in Deep Agents SDK

Swapping to an open model is a one-line change:

GLM-5:

code

pip install langchain-baseten

from deepagents import create_deep_agent

agent = create_deep_agent(model="baseten:zai-org/GLM-5")

code

MiniMax M2.7:

code

pip install langchain-openrouter

from deepagents import create_deep_agent

agent = create_deep_agent(model="openrouter:minimax/minimax-m2.7")

code

That's it. The harness handles the rest — it detects the model's context window size, disables unsupported modalities, and injects the right identity into the system prompt so the agent knows what it's working with.

The same open model is often available through multiple providers. Pick the one that matches your constraints. For example, GLM-5 is available as

code

baseten:zai-org/GLM-5

code

fireworks:fireworks/glm-5

, or

code

ollama:glm-5

for self-hosted. Same model, same harness, different infrastructure.

LangChain provides support for the most popular open model providers. The providers we have tested for this release are: Baseten, Fireworks, Groq, OpenRouter, and Ollama (cloud).

Harness-level adjustments for your model

Open models have different context windows, different tool-calling formats, and different failure modes than closed frontier models. The Deep Agents harness absorbs these differences so you don't have to:

Model identity injection — the system prompt is patched at runtime with the model's name, provider, context limit, and supported modalities. The agent knows what it is and what it can do.
Context management — compression, offloading, and summarization thresholds adapt to the model's actual context window, not a hardcoded default. A model with a 4K context gets more aggressive compaction than Opus with 1M.

Deep Agents CLI

Each model is also available in the Deep Agents CLI. The Deep Agents CLI is our open-source coding agent and alternative to Claude Code.

In addition to all the capabilities in Deep Agents SDK, the CLI supports Runtime model swapping. We introduced a new middleware (ConfigurableModelMiddleware ) to enable switching models mid-session without restarting the agent. This enables patterns like using a frontier model for planning and an open model for execution.

You can switch models mid-session with the /model slash command. This enables patterns like starting a task with a frontier model for planning, then switching to a cheaper open model for execution.

What’s next

Some things we’re excited to share soon:

Documenting harness tuning patterns for specific open model families
Testing multi-model subagent configurations (ex: frontier closed model orchestrator + open model subagents)

Open models work for agents today. We want to show the design patterns that help us engineer a good harness and build targeted evals that measure what matters for your task.

Deep Agents is open source. Try it with your preferred open model and come build great evals and agents with us.

Tuning Deep Agents to Work Well with Different Models

V. Trivedy,

M. Daugherty

April 29, 2026

min

.png)

Agent Architecture

LangSmith

Open Source

How LangSmith and LangChain OSS Help You Meet EU AI Act Requirements

J. Talbot,

B. Weng

April 27, 2026

min

Conceptual Guide

Deep Agents

The runtime behind production deep agents

S. Runkle,

V. Trivedy

April 20, 2026

min

See what your agent is really doing

LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

この記事をシェア

Simon Willison Blog2026年7月3日 23:50

2026年6月ニュースレター：Claude Fable 5、GPT-5.6、輸出規制、GLM-5.2の登場など

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

LangChain Blog·2026年5月4日 22:17·約13分

オープンモデルが新たな閾値を突破

#LLM #エージェント #オープンソースモデル #ツール呼び出し #GLM-5 #MiniMax

TL;DR

AI深層分析2026年7月5日 08:10

重要/ 5段階

深度40%

キーポイント

オープンモデルの実用性確立

コストとレイテンシの劇的改善

信頼性の高いツール呼び出し

影響分析・編集コメントを表示

影響分析

編集コメント

image

主要なポイント

なぜオープンモデルなのか

Model Type Input ($/M tokens) Output ($/M tokens)

Claude Opus 4.6 (Anthropic) Closed $5.00 $25.00

Claude Sonnet 4.6 (Anthropic) Closed $3.00 $15.00

GPT-5.4 (OpenAI) Closed $2.50 $15.00

GLM-5 (Baseten) Open $0.95 $3.15

MiniMax M2.7 (OpenRouter) Open $0.30 $1.20

評価方法について

正確性 — モデルが解決したテストの割合：合格数 / 総数。スコア0.68とは、テストケースの68%が正しく解決されたことを意味します。これは主要な品質シグナルです。
ソルブレート（Solve rate）— 精度と速度を組み合わせた測定値。各テストについて、期待されるステップ数を壁時計秒数で割って計算します（失敗したテストは0として扱います）。最終スコアは全テストにわたる平均値です。高いほど良く、タスクを正しくかつ迅速に解決するモデルが最高スコアを獲得します。
ステップ比（Step ratio）— モデルが実際に要したエージェントステップ数と、私たちが期待していたステップ数の比率を全テストで集約したものです：総実ステップ数 / 総期待ステップ数。値が1.0の場合、モデルは期待された正確な数のステップを使用しました。1.0より大きい場合は、より多くのステップが必要だった（非効率）、1.0より小さい場合は、当初の予想よりも少ないステップで済んだことを意味します。
ツール呼び出し比（Tool call ratio）— ステップ比と同じ考え方ですが、ステップではなく個々のツール呼び出しをカウントします。1.0は予算内、それ以上は予算超過、それ以下は予算未使用となります。

評価からの知見

オープンモデル

CI 実行の表示（モデル名をクリックすると個別の評価詳細が表示されます）

モデル

正答率

合格数

解決率

ステップ比

ツール呼び出し比

baseten:zai-org/GLM-5

0.64

138 件中 94

1.17

1.02

1.06

ollama:minimax-m2.7

0.57

138 件中 85

0.27

1.02

1.04

image

カテゴリ別正答率:

モデル

会話

ファイル操作

メモリ

検索

要約

ツール使用

ユニットテスト

baseten:zai-org/GLM-5

0.38

0.44

0.6

0.82

ollama:minimax-m2.7:cloud

0.14

0.92

0.38

0.8

0.6

0.87

0.92

フロンティアモデル

CI ランを実行（モデル名をクリックして個別の評価結果を表示）

モデル

正答率

合格数

解決率

ステップ比

ツール呼び出し比

anthropic:claude-opus-4-6

0.68

138件中 100件

0.38

0.99

1.02

google_genai:gemini-3.1-pro-preview

0.65

138件中 96件

0.26

0.99

1.01

openai:gpt-5.4

0.61

138件中 91件

0.61

1.05

1.15

image

カテゴリ別正答率:

モデル

会話

ファイル操作

メモリ

検索

要約

ツール使用

ユニットテスト

anthropic:claude-opus-4-6

0.05

0.67

0.87

google_genai:gemini-3.1-pro-preview

0.24

0.92

0.62

0.8

0.79

0.92

DIY: ローカルで Deep Agent 評価を実行する

すべてのオープンモデルに対して評価を実行する

pytest tests/evals --model-group open

特定のモデルに対して実行する

pytest tests/evals --model baseten:zai-org/GLM-5

これにより、同じタスクにおいて、同じ評価基準を用いて、オープンモデル同士やクローズドな最前線モデルとの比較が容易になります。

Deep Agents SDK でオープンモデルを使用する

オープンモデルへの切り替えは、1 行の変更で完了します：

GLM-5:

pip install langchain-baseten

from deepagents import create_deep_agent

agent = create_deep_agent(model="baseten:zai-org/GLM-5")

MiniMax M2.7:

pip install langchain-openrouter

from deepagents import create_deep_agent

agent = create_deep_agent(model="openrouter:minimax/minimax-m2.7")

モデルごとのハーンレベルでの調整

モデルアイデンティティの注入 — システムプロンプトはランタイムでパッチされ、モデル名、プロバイダー、コンテキスト制限、およびサポートされるモダリティが含まれます。エージェント自身が何者であり、何をできるかを認識します。
コンテキスト管理 — 圧縮、オフローディング、要約の閾値は、ハードコードされたデフォルトではなく、モデルの実際のコンテキストウィンドウに合わせて適応されます。4K のコンテキストを持つモデルは、1M のコンテキストを持つ Opus よりもより積極的な圧縮を行います。

Deep Agents CLI

今後の展望

近日公開予定のいくつかの話題をご紹介します：

特定のオープンソースモデルファミリー向けのハッチング調整パターンのドキュメント化
マルチモデルサブエージェント構成のテスト（例：最先端クローズドモデルによるオーケストレーターと、オープンソースモデルによるサブエージェント）

異なるモデルとよく動作するように Deep Agents をチューニングする

image

V. Trivedy,

M. Daugherty

2026 年 4 月 29 日

image

5 分

image.png)

エージェントアーキテクチャ（Agent Architecture）

LangSmith

オープンソース

EU AI 法（EU AI Act）の要件を満たすために LangSmith と LangChain OSS がどのように役立つか

image

J. Talbot,

B. Weng

2026 年 4 月 27 日

image

7 分

image

Conceptual Guide

Deep Agents

本番環境のディープエージェントを支えるランタイム

image

S. Runkle,

V. Trivedy

2026 年 4 月 20 日

image

24 分

image

エージェントが実際に何をしているかを確認する

原文を表示

Key Takeaways

Why open models

When exploring open models, builders and customers tend to focus on a few key factors: cost, latency, and task performance.

Model

Type

Input ($/M tokens)

Output ($/M tokens)

Claude Opus 4.6 (Anthropic)

Closed

$5.00

$25.00

Claude Sonnet 4.6 (Anthropic)

Closed

$3.00

$15.00

GPT-5.4 (OpenAI)

Closed

$2.50

$15.00

GLM-5 (Baseten)

Open

$0.95

$3.15

MiniMax M2.7 (OpenRouter)

Open

$0.30

$1.20

*To put the pricing in context: an application outputting 10M tokens/day costs roughly $250/day on Opus 4.6 versus ~$12/day for MiniMax M2.7. That's about a $87k annual difference.*

How we evaluated

Each eval case defines success assertions (hard-fail checks that determine correctness) and efficiency assertions (soft checks that measure how the model got there). We report four metrics:

Correctness — the fraction of tests the model solved: passed / total. A score of 0.68 means 68% of test cases were solved correctly. This is the primary quality signal.
Solve rate — a combined measure of accuracy and speed. For each test, we compute expected_steps / wall_clock_seconds; failed tests contribute zero. The final score is the average across all tests. Higher is better — a model that solves tasks both correctly and quickly scores highest.
Step ratio — how many agentic steps the model actually took compared to how many we expected, aggregated across all tests: total_actual_steps / total_expected_steps. A value of 1.0 means the model used exactly the expected number of steps. Above 1.0 means it needed more (less efficient); below 1.0 means it needed fewer steps than initially expected.
Tool call ratio — same idea as step ratio, but counting individual tool calls instead of steps. 1.0 is on-budget, above is over-budget, below is under-budget.

Findings from our evals

These are early results; we’re actively maintaining and expanding our eval set. You can view recent runs in realtime both in our GitHub repo and at this shared LangSmith project.

Open models

View CI run (click model names to view individual evals)

Model

Correctness

Passed

Solve Rate

Step Ratio

Tool Call Ratio

baseten:zai-org/GLM-5

0.64

94 of 138

1.17

1.02

1.06

ollama:minimax-m2.7

0.57

85 of 138

0.27

1.02

1.04

Per-category correctness:

Model

Conversation

File Ops

Memory

Retrieval

Summarization

Tool Use

Unit Test

baseten:zai-org/GLM-5

0.38

0.44

0.6

0.82

ollama:minimax-m2.7:cloud

0.14

0.92

0.38

0.8

0.6

0.87

0.92

Frontier models

View CI run (click model names to view individual evals)

Model

Correctness

Passed

Solve Rate

Step Ratio

Tool Call Ratio

anthropic:claude-opus-4-6

0.68

100 of 138

0.38

0.99

1.02

google_genai:gemini-3.1-pro-preview

0.65

96 of 138

0.26

0.99

1.01

openai:gpt-5.4

0.61

91 of 138

0.61

1.05

1.15

Per-category correctness:

Model

Conversation

File Ops

Memory

Retrieval

Summarization

Tool Use

Unit Test

anthropic:claude-opus-4-6

0.05

0.67

0.87

google_genai:gemini-3.1-pro-preview

0.24

0.92

0.62

0.8

0.79

0.92

openai:gpt-5.4

0.29

0.44

0.8

0.76

‍*For each model, we opt to use the provider’s default thinking level.*‍*For Gemini 3+, this is high*‍*For OpenAI, this is medium*‍*For Claude, this is without extended thinking*

DIY: Run Deep Agent evals locally

Our CI runs the same evaluation suite across 52 models organized into groups — including an open group (

code

baseten:zai-org/GLM-5

code

ollama:minimax-m2.7:cloud

code

ollama:nemotron-3-super

) that runs on every eval workflow. You can target any model group:

code

Run evals against all open models

pytest tests/evals --model-group open

Run against a specific model

pytest tests/evals --model baseten:zai-org/GLM-5

code

This makes it straightforward to compare open models against each other and against closed frontier models on the same tasks, using the same grading criteria.

Using open models in Deep Agents SDK

Swapping to an open model is a one-line change:

GLM-5:

code

pip install langchain-baseten

from deepagents import create_deep_agent

agent = create_deep_agent(model="baseten:zai-org/GLM-5")

code

MiniMax M2.7:

code

pip install langchain-openrouter

from deepagents import create_deep_agent

agent = create_deep_agent(model="openrouter:minimax/minimax-m2.7")

code

The same open model is often available through multiple providers. Pick the one that matches your constraints. For example, GLM-5 is available as

code

baseten:zai-org/GLM-5

code

fireworks:fireworks/glm-5

, or

code

ollama:glm-5

for self-hosted. Same model, same harness, different infrastructure.

LangChain provides support for the most popular open model providers. The providers we have tested for this release are: Baseten, Fireworks, Groq, OpenRouter, and Ollama (cloud).

Harness-level adjustments for your model

Model identity injection — the system prompt is patched at runtime with the model's name, provider, context limit, and supported modalities. The agent knows what it is and what it can do.
Context management — compression, offloading, and summarization thresholds adapt to the model's actual context window, not a hardcoded default. A model with a 4K context gets more aggressive compaction than Opus with 1M.

Deep Agents CLI

Each model is also available in the Deep Agents CLI. The Deep Agents CLI is our open-source coding agent and alternative to Claude Code.

You can switch models mid-session with the /model slash command. This enables patterns like starting a task with a frontier model for planning, then switching to a cheaper open model for execution.

What’s next

Some things we’re excited to share soon:

Documenting harness tuning patterns for specific open model families
Testing multi-model subagent configurations (ex: frontier closed model orchestrator + open model subagents)

Open models work for agents today. We want to show the design patterns that help us engineer a good harness and build targeted evals that measure what matters for your task.

Deep Agents is open source. Try it with your preferred open model and come build great evals and agents with us.

Tuning Deep Agents to Work Well with Different Models

V. Trivedy,

M. Daugherty

April 29, 2026

min

.png)

Agent Architecture

LangSmith

Open Source

How LangSmith and LangChain OSS Help You Meet EU AI Act Requirements

J. Talbot,

B. Weng

April 27, 2026

min

Conceptual Guide

Deep Agents

The runtime behind production deep agents

S. Runkle,

V. Trivedy

April 20, 2026

min

See what your agent is really doing

LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

この記事をシェア

Simon Willison Blog2026年7月3日 23:50

2026年6月ニュースレター：Claude Fable 5、GPT-5.6、輸出規制、GLM-5.2の登場など

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

主要なポイント

なぜオープンモデルなのか

評価方法について

評価からの知見

オープンモデル

フロンティアモデル

DIY: ローカルで Deep Agent 評価を実行する

すべてのオープンモデルに対して評価を実行する

特定のモデルに対して実行する

Deep Agents SDK でオープンモデルを使用する

pip install langchain-baseten

pip install langchain-openrouter

モデルごとのハーンレベルでの調整

Deep Agents CLI

今後の展望

関連コンテンツ

異なるモデルとよく動作するように Deep Agents をチューニングする

EU AI 法（EU AI Act）の要件を満たすために LangSmith と LangChain OSS がどのように役立つか

本番環境のディープエージェントを支えるランタイム

エージェントが実際に何をしているかを確認する

Key Takeaways

Why open models

How we evaluated

Findings from our evals

Open models

Frontier models

DIY: Run Deep Agent evals locally

Run evals against all open models

Run against a specific model

Using open models in Deep Agents SDK

pip install langchain-baseten

pip install langchain-openrouter

Harness-level adjustments for your model

Deep Agents CLI

What’s next

Related content

Tuning Deep Agents to Work Well with Different Models

How LangSmith and LangChain OSS Help You Meet EU AI Act Requirements

The runtime behind production deep agents

See what your agent is really doing

関連記事

キーポイント

影響分析

編集コメント

主要なポイント

なぜオープンモデルなのか

評価方法について

評価からの知見

オープンモデル

フロンティアモデル

DIY: ローカルで Deep Agent 評価を実行する

すべてのオープンモデルに対して評価を実行する

特定のモデルに対して実行する

Deep Agents SDK でオープンモデルを使用する

pip install langchain-baseten

pip install langchain-openrouter

モデルごとのハーンレベルでの調整

Deep Agents CLI

今後の展望

関連コンテンツ

異なるモデルとよく動作するように Deep Agents をチューニングする

EU AI 法（EU AI Act）の要件を満たすために LangSmith と LangChain OSS がどのように役立つか

本番環境のディープエージェントを支えるランタイム

エージェントが実際に何をしているかを確認する

Key Takeaways

Why open models

How we evaluated

Findings from our evals

Open models

Frontier models

DIY: Run Deep Agent evals locally

Run evals against all open models

Run against a specific model

Using open models in Deep Agents SDK

pip install langchain-baseten

pip install langchain-openrouter

Harness-level adjustments for your model