LangChain Blog·2026年7月1日 00:22·約12分

Harbor と LangChain の統合：エージェント評価のための統一スタック

#Agent Evaluation #LangChain #Harbor #LLM Ops #Agentic Workflow

TL;DR

LangChain は、エージェントのパフォーマンスを効率的に検証・評価するためのプラットフォーム「Harbor」との連携を発表し、開発者向けの一元化された評価環境を提供する。

AI深層分析2026年7月1日 01:02

注目/ 5段階

深度40%

キーポイント

エージェント評価の一元化

LangChain は開発者がエージェントのパフォーマンスを効率的に検証できる新しい統合プラットフォーム「Harbor」と連携を開始した。

検証環境の提供

この連携により、複雑な評価プロセスが簡素化され、開発者はより迅速かつ正確にエージェントの動作を検証できる環境が整った。

開発者支援の強化

エージェント開発におけるボトルネックとなりがちな評価フェーズを効率化することで、製品化までのサイクル短縮を目指す。

影響分析・編集コメントを表示

影響分析

本ニュースは、生成 AI エージェントの実用化において最も課題となる「評価・検証」のハードルを下げる重要な一歩です。LangChain が持つエコシステムと Harbor の専門性を組み合わせることで、開発者がより信頼性の高いエージェントを迅速に構築・リリースできる土壌が整います。これにより、AI エージェント市場全体の成熟度向上と採用加速が期待されます。

編集コメント

エージェント開発の「評価」フェーズを専門プラットフォームに委ねる動きは、業界標準化に向けた重要な転換点です。LangChain のエコシステム強化と相まって、実運用レベルのエージェント構築がより身近になるでしょう。

image

主要なポイント

エージェントを Harbor に接続するための小さなエントリーポイントは一つだけです。langgraph.json レジストリと make_graph ファクトリーが唯一の接着剤となり、このファクトリーはコマンドラインから Harbor が渡すモデルを読み取ることで、モデルに依存しない状態を維持できます。
クラウドサンドボックスにより、評価（evals）を水平方向にスケールさせ、エージェントを隔離して実行することが可能です。各試行には新しい LangSmith サンドボックスが割り当てられるため、試行間で状態が共有されることはなく、一台のマシンで逐次処理するのではなく、数百の試行を並列で実行できます。
トレース（traces）により、スコアに説明が付与されます。langsmith プラグインを使用すると、すべてのジョブがデータセットと実験として記録され、検証者の報酬フィードバックが付加されます。また、エージェントのトレースも直接紐付けられるため、試行が成功したか失敗したかだけでなく、その理由を確認できます。

エージェントの能力が高まるにつれて、評価はより困難になっています。Claude Code や Pi、Deep Agents といったエージェントハネスは、ファイルの読み取り、スクリプトの実行、コードの実行など、エージェントがコンピュータ全体にアクセスすることを可能にしています。現在、すべてのエージェントは、特定のタスクに対して、それぞれ独自のクリーンで再現可能な環境で実行される必要があります。

長期実行かつ状態を保持するエージェントの評価には、新しい評価ランナーが必要です。Harbor はこの分野における業界のリーダーとして台頭しています。本ブログではまず、なぜエージェント評価を行うすべての人が Harbor の概要を知るべきかを説明し、その後、Deep Agents、LangSmith Sandboxes（サンドボックス）、および LangSmith Experiments（実験）を Harbor に統合する方法を示します。

最終的に必要なのは、エージェントを実行可能な環境で、再現性があり、隔離された状態で、並列に多数回実行し、最後に決定論的なチェックを行うことです。Harbor はこの課題を解決し、現在では Deep Agents、LangSmith Sandboxes（サンドボックス）、および LangSmith Observability（観測）と直接連携しています。

Harbor の仕組み

Harbor は評価ハッチです。利用者が用意するのは以下の 3 つだけです。

エージェント
データセット
サンドボックス

各データセットには、タスクが含まれており、これらは以下の要素で構成されます。

環境（Dockerfile / Docker Compose YAML）
指示（Markdown）
評価スクリプト（test.sh）

単純な LLM 評価と比較して、主な違いは 2 つあります。

エージェントが実行される環境は非常に重要であり、タスクの一部として明確に言及する必要があります。単純な LLM 評価では環境は不要で、単に LLM を呼び出すだけで済みますが、エージェントには必要です。
エージェントの評価はスクリプトによって行われます。多くの場合、エージェントは他のファイルを生成するか、何らかの方法で状態を変更します。エージェントの最終応答を眺めるだけでは不十分で、その過程で作成されるアーティファクトも確認する必要があります。

LangChain は Harbor の 3 つの場所で連携しています。Deep Agents と統合することで、構築したあらゆる Deep Agent を Harbor のサンドボックス環境内で実行できます。また LangSmith Sandboxes とも統合しており、Harbor は各タスクを LangSmith のサンドボックスで実行し、各ランごとにクリーンなマシンを提供します。さらに LangSmith Observability（詳細結果を表示する評価プラットフォーム）とも連携しています。エージェントがトレースをサポートしている場合、すべての job は dataset として登録され、エージェントのトレースが付随した実験として扱われます。

LangChain エージェントと Harbor の統合

カスタムエージェントは、--agent langgraph で選択されるビルトインの LangGraph エージェントを通じて Harbor に接続されます。これにより、Deep Agent を含むあらゆる LangGraph アプリケーションを実行できます。

Harbor は langgraph.json をレジストリとして扱います。これは、エージェントが必要とする依存関係をリストし、グラフ名をそれを構築する関数にマッピングします:

{

"dependencies": [

"deepagents>=0.6.10,<0.7.0",

"langchain-fireworks>=1.3.1,<1.4.0"

"graphs": {

"deep_agent": "./agent.py:make_graph"

}

ここで deep_agent は agent.py 内の make_graph に解決され、これが Deep Agent を構築して Harbor が呼び出すコンパイル済みグラフを返します:

from deepagents import create_deep_agent

from deepagents.backends import LocalShellBackend

def make_graph():

return create_deep_agent(

model="fireworks:accounts/fireworks/models/glm-5p2",

backend=LocalShellBackend(),

)

これがあなたが記述する唯一の接着剤です。エージェントはあなたの独自のコードのままです。make_graph は Harbor が呼び出すエントリーポイントに過ぎません。デフォルトでは create_deep_agent は、サンドボックスに触れることのないメモリ内の仮想ファイルシステム内にファイルを保持するため、エージェントに Harbor 内で実行される環境への実際のファイルおよびシェルアクセスを与えるには、LocalShellBackend と組み合わせて使用します。

各 trial について、Harbor はこのエージェントをその trial のサンドボックスへコピーし、langgraph.json に記載された依存関係をそこで新しい仮想環境にインストールして、コンテナ内でグラフを実行します。各サンドボックスは独自のコピーを持つため、trial は状態を共有せず、あなたのエージェントは完全な隔離状態で実行されます。

サイドノート: グラフはモデルをハードコードすることもできますが、エントリーポイントは Harbor が実行設定（run config）を引数として呼び出すファクトリ関数であることも可能です。Harbor は --model で選択されたモデルを configurable.model に格納するため、上記のファクトリ関数はモデルに依存せず、コマンドラインで指定した内容をそのまま create_deep_agent へ渡します。

from deepagents import create_deep_agent

from deepagents.backends import LocalShellBackend

def make_graph(config):

return create_deep_agent(

model=config["configurable"]["model"],

backend=LocalShellBackend(),

)

LangSmith のサンドボックスを Harbor で統一する

クラウドベースのサンドボックスで評価（evals）を実行することで、水平方向へのスケーリングが可能になり、はるかに迅速なフィードバックが得られます。一度に数百のトライアルを実行できるため、単一のマシンがシリアルで処理するよりも効率的です。また、サンドボックスは制約付きの実行環境であり、これは環境にアクセスする長時間稼働型のエージェントにとってまさに必要なものです：外部に影響を与えずに動作するためのクリーンで隔離された場所です。

すべてのトライアルは独自のクラウドサンドボックス内で実行されます。-e langsmith で選択した LangSmith Sandbox を利用しますが、環境はプラグイン可能になっています。Harbor は Daytona、Docker、Modal、E2B もサポートしており、これらはすべて同じ -e フラグの背後で相互に交換可能です。プロバイダーを切り替えても、エージェント、データセット、検証器（verifier）には影響しません。

Trial は作業の原子単位であり、1 つのタスクに対してエージェントを 1 回実行したものを指します。エージェントは非決定論的であるため、通常は各タスクを複数回実行する必要があります。ここで n_attempts は Harbor が各タスクを何回繰り返すかを示し、スコアを平均化することで、単一の幸運な失敗や不運な失敗が結果を決定づけることがないようにしています。したがって、あなたのジョブ全体は n_attempts × タスク数となります：各タスクを n_attempts 回実行し、それぞれの反復が独立した trial となります。Harbor がこれらすべてをオーケストレーションします。

各 trial に対して、Harbor は新しいサンドボックスをプロビジョニングし、その実行に必要なすべてのものをコピーします：エージェントコード、タスク（ディスクにキャッシュされ、その後サンドボックス VM に読み込まれる）、および実行開始時に必要な初期ファイルです。その後、指示に対してエージェントを実行し、検証者（verifier）を実行して結果を記録します。Harbor は、あなたが関心を持つ指標に基づいて、trial 間の結果を平均化し、単一のジョブ結果としてまとめます。

LangSmith の観測機能と Harbor の統合

harbor-langsmith インテグレーションにより、Harbor にLangSmith トレーシングのファーストクラスサポートが追加され、データセットおよび実験へのログ出力も可能になりました。

--plugin langsmith という単一のフラグを有効化するだけで、Harbor はすべてのジョブを LangSmith に記録します。データセットの同期、実験の作成、そして各トライアルごとに検証者の報酬をフィードバックとして記録されたランが生成されます。テスト対象のエージェントが LangSmith トレースをサポートしている場合、そのトレースは直接実験に紐付けられるため、スコアとともに詳細なステップバイステップの軌跡を確認できます。もしトレース機能がない場合でも、データセット、実験、結果、フィードバックは引き続き取得可能です。

「データセットと実験」セクションでは、現在使用されているすべてのアクティブなデータセットを一覧表示できます。

image

実験とは、特定のデータセットに対する一連のラン全体を指します。特定のデータセットに関する個別の実験とそのスコアや統計情報を表示するには、該当項目をクリックしてください。

image

評価にトレース情報を統合することで、さらに評価を精緻化でき、その結果としてエージェントの理解と改善がより深まると考えています。スコアは「どのトライアルが合格したか」を示しますが、トレースは「なぜそうなったのか」という理由を教えてくれます。

結果：エージェント向けの完全な評価スタック

これらを組み合わせることで、各レイヤーがそれぞれの役割を十分に果たす、エージェント評価のための完全なスタックが構築されます。

Harbor - トライアルをオーケストレーションする評価ハネス。
Deep Agents - 評価対象となるエージェントを構築するためのもの。
LangSmith sandboxes - 隔離されたクラウド実行環境。
LangSmith - データセット、実験、トレース、スコアのための記録システム（system of record）。

そして、あなたが持ち込む部分は小さく保たれます：

トレーシングの有無に関わらず、あなた自身のエージェント。
レジストリからリモートで、またはディスク上にローカルな、あなた自身のデータセット。
クラウドサンドボックス — LangSmith を -e langsmith オプションで指定。
UI 表示 — --plugin langsmith で指定。

LangSmith アカウントとデータセットをお持ちであれば、langsmith エクストラ付きで Harbor をインストールすることで全体を試すことができます。これにより、LangSmith のサンドボックス環境と評価プラグインの両方が導入されます。その後、LangSmith とモデルの認証情報を設定し、トレーシングを有効にしてエージェントのトレースが実験に紐づくようにします：

pip install "harbor[langsmith]"

export LANGSMITH_API_KEY="<LANGSMITH_API_KEY>"

export LANGSMITH_PROFILE=prod

export LANGSMITH_TRACING=true

export LANGSMITH_PROJECT=harbor-deepagents

export FIREWORKS_API_KEY="<FIREWORKS_API_KEY>"

harbor run \

--agent langgraph \

--model fireworks:accounts/fireworks/models/glm-5p2 \ # エージェント

--ak project_path=./deep-agent --ak graph=deep_agent \

-d terminal-bench@2.0 \ # タスクのデータセット

-e langsmith \ # クラウド環境

--plugin langsmith

Harbor の統合ドキュメントを読んで始めましょう。Harbor での評価（evals）の実行については、評価の実行をご覧ください。

Deep Agents、LangSmith、および You.com Finance Research API を用いた EU のマクロ経済分析

image

image.png)

S. Tangedipalli,

K. Singh,

S. Sharma,

A. Pothana

2026 年 5 月 20 日

image

18 分

image.png)

パートナー

Deep Agents

Deep Agents、LangSmith、Parallel を用いた企業デューデリジェンスエージェントの構築

image

M. ハリス,

N. マルティッチ,

S. タンゲディパッリ,

K. シング

2026 年 5 月 8 日

image

分

image.png)

エージェントアーキテクチャ

パートナー

エージェント工学：AI エージェントの群れがソフトウェアエンジニアリングを再定義する

image

R. クマール,

P. ラマゴパル

2026 年 4 月 17 日

image

分

image

エージェントの実際の動作を確認する

LangSmith は、エージェントエンジニアリングプラットフォームであり、開発者がすべてのエージェントの意思決定をデバッグし、変更の評価を行い、ワンクリックでデプロイできるように支援します。

原文を表示

Key Takeaways

One small entry point connects your agent to Harbor. A langgraph.json registry plus a make_graph factory is the only glue you write, and that factory can stay model-agnostic by reading the model Harbor passes from the command line
Cloud sandboxes let you scale evals horizontally and run agents in isolation. Each trial gets a fresh LangSmith sandbox, so trials never share state, and you can run hundreds in parallel instead of churning through them serially on one machine
Traces turn scores into explanations. With the langsmith plugin, every job lands as a dataset and experiment with the verifier's reward as feedback, and agent traces attach directly so you can see why a trial passed or failed, not just whether it did

As agents increase in capabilities, evaluations have gotten more difficult. Agent harnesses like Claude Code, Pi, and Deep Agents now give agents access to entire computers to read files, execute scripts, run code, and more. Every agent now needs to run in its own clean, reproducible environment for a given task.

Evaluating long-running, stateful agents requires a new eval runner. Harbor has emerged as the industry leader in this space. In this blog, we first explain why everyone running agent evals should know what Harbor is and then show how to integrate Deep Agents, LangSmith Sandboxes, and LangSmith Experiments into Harbor.

We ultimately need to run agents in a real, reproducible, isolated environment, many times in parallel, with a deterministic check at the end. Harbor solves this problem and is now wired directly into Deep Agents, LangSmith Sandboxes, and LangSmith Observability.

How Harbor works

Harbor is an eval harness. You bring three things:

Your agent
Your dataset
Your sandbox

Each dataset has tasks, which consist of:

An Environment (Dockerfile / Docker Compose YAML)
An Instruction (Markdown)
An Evaluation script (test.sh)

Compared to simpler LLM evaluation, there are two main differences:

The environment where the agent is running in is very important - so important that it needs to be called out as part of the task! Simpler LLM evals don’t need an environment - they just call the LLM. Agents do!
Judging the agent is done with a script. Oftentimes the agent produces other files or modifies state in some way. It’s not just enough to look at the agent’s final response - you need to look at the artifacts it creates along the way.

LangChain plugs into Harbor in three places. We integrate with Deep Agents so any deep agent you build can run inside Harbor's sandboxed environment. We integrate with LangSmith Sandboxes so Harbor can run each task in a LangSmith sandbox, giving each run its own clean machine. And we integrate with LangSmith Observability, the evaluation platform where you view results in detail: every job lands as a dataset and experiment with agent traces attached when the agent supports them.

Unifying LangChain agents with Harbor

You plug a custom agent into Harbor through its built-in langgraph agent, selected with --agent langgraph. It runs any LangGraph application including Deep Agents.

Harbor treats langgraph.json as a registry. It lists the dependencies your agent needs and maps a graph name to the function that builds it:

code

{

"dependencies": [

"deepagents>=0.6.10,=1.3.1,

code


<p>Here <code>deep_agent</code> resolves to <code>make_graph</code> in <code>agent.py</code>, which builds your Deep Agent and returns the compiled graph Harbor invokes:</p>

code

from deepagents import create_deep_agent
from deepagents.backends import LocalShellBackend
 
 
def make_graph():
    return create_deep_agent(
        model="fireworks:accounts/fireworks/models/glm-5p2",
        backend=LocalShellBackend(),
    )

code


<p>This is the only glue you write. Your agent stays your own code; <code>make_graph</code> is just the entry point Harbor calls. By default <code>create_deep_agent</code> keeps files in an in-memory virtual filesystem that never touches the sandbox, so pair it with a <code>LocalShellBackend</code> to give the agent real file and shell access to the environment Harbor runs it in.</p><p>For every <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a980573" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a980573">trial</a>, Harbor copies this agent into that trial's sandbox, installs the <code>langgraph.json</code> dependencies into a fresh virtual environment there, and runs the graph inside the container. Each sandbox gets its own copy, so trials never share state and your agent runs in full isolation.</p><p><strong>Side note:</strong> A graph can hardcode its model, but the entry can also be a <strong>factory function</strong> that Harbor calls with the run config. Harbor puts the model selected with <code>--model</code> in <code>configurable.model</code>, so the factory above stays model-agnostic and hands whatever you pass on the command line straight to <code>create_deep_agent</code>.</p>

code

from deepagents import create_deep_agent
from deepagents.backends import LocalShellBackend
 
 
def make_graph(config):
    return create_deep_agent(
        model=config["configurable"]["model"],
        backend=LocalShellBackend(),
    )

code


</div><div id="unifying-langsmith-sandboxes-with-harbor"><h3>Unifying LangSmith sandboxes with Harbor</h3><p>Running evals in cloud-based sandboxes lets you <strong>horizontally scale</strong> for much quicker feedback - hundreds of <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805a0" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805a0">trials</a> at once instead of one machine churning through them serially. And the sandbox is a <strong>constrained execution environment</strong>, which is exactly what a long-running agent that touches its environment needs: a clean, isolated place to act without affecting anything outside it.</p><p>Every <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805a8" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805a8">trial</a> runs in its own cloud sandbox. You bring the <a href="https://docs.langchain.com/langsmith/sandboxes" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805ab" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805ab"><strong>LangSmith Sandbox</strong></a>, selected with <code>-e langsmith</code>, but the environment is pluggable. Harbor supports Daytona, Docker, Modal, and E2B too, all interchangeable behind the same <code>-e</code> flag. Switching providers does not touch your agent, dataset, or verifier.</p><p>A <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805b7" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805b7"><strong>trial</strong></a> is the atomic unit of work: one run of your agent on one <a href="https://www.harborframework.com/docs/core-concepts#task" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805bb" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805bb">task</a>. Because agents are non-deterministic, you usually run each task more than once &nbsp;<code>n_attempts</code> is how many times Harbor repeats every task and averages the scores so a single lucky or unlucky run does not define the result. Your whole <a href="https://www.harborframework.com/docs/core-concepts#job" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805c1" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805c1"><strong>job</strong></a> is therefore <code>n_attempts × tasks</code>: every task, run <code>n_attempts</code> times, each repetition its own trial. Harbor orchestrates all of it.</p><p>For each <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805cd" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805cd">trial</a>, Harbor provisions a fresh sandbox and copies in everything that run needs: your agent code, the <a href="https://www.harborframework.com/docs/core-concepts#task" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805d0" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805d0">task</a> (cached on disk, then loaded into the sandbox VM), and whatever starting files the run begins from. It then runs the agent against the instruction, runs the verifier, and records the result. Harbor averages across trials into a single job result with the metrics you care about.</p></div><div id="unifying-langsmith-observability-with-harbor"><h3>Unifying LangSmith Observability with Harbor</h3><p>The <code>harbor-langsmith</code> integration brings <strong>first-class support for LangSmith tracing</strong> into Harbor, plus logging to <a href="https://www.harborframework.com/docs/core-concepts#dataset" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805dd" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805dd">datasets</a> and experiments.</p><p>Enable it with a single flag, <code>--plugin langsmith</code>. Harbor then records every job to LangSmith: it syncs the dataset, creates an experiment, and logs a run per trial with the verifier’s reward as feedback. If the agent under test supports LangSmith tracing, those traces attach directly to the experiment - so you get the full step-by-step trajectory alongside the score. If it does not trace, you still get the dataset, experiment, results, and feedback.</p><p>Under Datasets &amp; Experiments we are able to view all of our active datasets that are being used.</p><figure><p><img alt="" src="https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a42c2a4d13c54e4096c8695_Screenshot%202026-06-22%20at%2010.58.47%E2%80%AFAM.png" loading="eager"></p></figure><p>An experiment is an entire run on a given dataset. To view the specific experiments and their respective scores and statistics for a given dataset, click into it.</p><figure><p><img alt="" src="https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a42c27efbcc34afc06153ad_Screenshot%202026-06-22%20at%202.38.08%E2%80%AFPM.png" loading="eager"></p></figure><p>We believe integrating traces into evals lets you further refine your evals, and in turn better understand and improve your agents. The score tells you <em>whether</em> a trial passed; the trace tells you <em>why</em>.</p></div></div><div id="the-result-a-full-eval-stack-for-agents"><h2>The result: a full eval stack for agents</h2><p>Put together, this is a complete stack for evaluating agents, where each layer does one job well:</p><ul role="list"><li><strong>Harbor</strong> - the eval harness that orchestrates trials.</li><li><strong>Deep Agents</strong> - for building the agents under test.</li><li><strong>LangSmith sandboxes</strong> - the isolated cloud execution environment.</li><li><strong>LangSmith</strong> - the system of record for datasets, experiments, traces, and scores.</li></ul><p>And the part you bring stays small:</p><ul role="list"><li><strong>Your agent</strong>, with or without tracing.</li><li><strong>Your dataset</strong>, remote from a registry or local on disk.</li><li><strong>Your cloud sandbox</strong> — LangSmith, with <code>-e langsmith</code>.</li><li><strong>Your UI view</strong> — <code>--plugin langsmith</code>.</li></ul><p>If you have a LangSmith account and a dataset, you can try the whole thing by installing Harbor with the <code>langsmith</code> extra, which brings both the LangSmith sandbox environment and the eval plugin. Then set your LangSmith and model credentials, and turn on tracing so the agent's traces attach to the experiment:</p>

code

pip install "harbor[langsmith]"
export LANGSMITH_API_KEY=""
 
export LANGSMITH_PROFILE=prod
export LANGSMITH_TRACING=true
export LANGSMITH_PROJECT=harbor-deepagents
export FIREWORKS_API_KEY=""

code

harbor run \
  --agent langgraph \
  --model fireworks:accounts/fireworks/models/glm-5p2 \   # agent
  --ak project_path=./deep-agent --ak graph=deep_agent \  
  -d terminal-bench@2.0 \                                 # dataset of tasks
  -e langsmith \                                          # cloud environment
  --plugin langsmith

code


[Read the Harbor integrations docs](https://docs.langchain.com/langsmith/harbor-integrations) to get started. For more on running evals in Harbor, see [Run evals](https://www.harborframework.com/docs/run-jobs/run-evals).

‍

### Related content

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a19109c2af7527cb5ec809e_logo%20and%20title%20-%2020%20characters%20max%20(8).png)

Partner

Deep Agents

LangSmith

### EU macroeconomic analysis with Deep Agents, LangSmith, and the You.com Finance Research API

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd2d50cd0f84dacf92e7b_ProfilePic.png)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd29baf4c28709e2566a7_headshot.jpg)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a0ddbeae5c6993cbe1a4a8b_Screenshot%202026-05-20%20at%209.05.52%E2%80%AFAM.png)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a0ddba248048cac2a92774c_image%20(4).png)

S. Tangedipalli,

K. Singh,

S. Sharma,

A. Pothana

May 20, 2026

![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg)

18

min

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc07193192cebc73980fd3_logo%20and%20title%20-%2020%20characters%20max%20(6).png)

Partner

Deep Agents

### Building a company due diligence agent with Deep Agents, LangSmith and Parallel

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc01c6959ca5fd924ab432_MattHarris.jpg)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc01b812793b72539057d5_nick%20headshot.jpeg)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd2d50cd0f84dacf92e7b_ProfilePic.png)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd29baf4c28709e2566a7_headshot.jpg)

M. Harris,

N. Martitsch,

S. Tangedipalli,

K. Singh

May 8, 2026

![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg)

9

min

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e23754937c2f749d12bb0b_76%20(1).png)

Agent Architecture

Partner

### Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e234176723e6111407b935_renuka-kumar.png)

![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e23427e77d2631610e5d62_Prashanth-Ramagopal.png)

R. Kumar,

P. Ramagopal

April 17, 2026

![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg)

11

min

![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69ce01ea562f8cc223cabf25_Frame%202147254328.svg)

### See what your agent is really doing

LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

この記事をシェア

LangChain Blog重要度42026年7月1日 00:53

サンドボックスなしで不審なコードを実行する深層エージェントの仕組み

LangChain Blog重要度42026年6月30日 01:17

Deep Agents に動的サブエージェントを導入

LangChain Blog重要度42026年6月27日 02:13

Deep Agents との連携によるプロンプトキャッシング

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

from deepagents import create_deep_agent from deepagents.backends import LocalShellBackend def make_graph(): return create_deep_agent( model="fireworks:accounts/fireworks/models/glm-5p2", backend=LocalShellBackend(), )

This is the only glue you write. Your agent stays your own code; <code>make_graph</code> is just the entry point Harbor calls. By default <code>create_deep_agent</code> keeps files in an in-memory virtual filesystem that never touches the sandbox, so pair it with a <code>LocalShellBackend</code> to give the agent real file and shell access to the environment Harbor runs it in.For every <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a980573" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a980573">trial</a>, Harbor copies this agent into that trial's sandbox, installs the <code>langgraph.json</code> dependencies into a fresh virtual environment there, and runs the graph inside the container. Each sandbox gets its own copy, so trials never share state and your agent runs in full isolation.Side note: A graph can hardcode its model, but the entry can also be a factory function that Harbor calls with the run config. Harbor puts the model selected with <code>--model</code> in <code>configurable.model</code>, so the factory above stays model-agnostic and hands whatever you pass on the command line straight to <code>create_deep_agent</code>.

from deepagents import create_deep_agent from deepagents.backends import LocalShellBackend def make_graph(config): return create_deep_agent( model=config["configurable"]["model"], backend=LocalShellBackend(), )

</div><div id="unifying-langsmith-sandboxes-with-harbor"><h3>Unifying LangSmith sandboxes with Harbor</h3>Running evals in cloud-based sandboxes lets you horizontally scale for much quicker feedback - hundreds of <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805a0" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805a0">trials</a> at once instead of one machine churning through them serially. And the sandbox is a constrained execution environment, which is exactly what a long-running agent that touches its environment needs: a clean, isolated place to act without affecting anything outside it.Every <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805a8" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805a8">trial</a> runs in its own cloud sandbox. You bring the <a href="https://docs.langchain.com/langsmith/sandboxes" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805ab" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805ab">LangSmith Sandbox</a>, selected with <code>-e langsmith</code>, but the environment is pluggable. Harbor supports Daytona, Docker, Modal, and E2B too, all interchangeable behind the same <code>-e</code> flag. Switching providers does not touch your agent, dataset, or verifier.A <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805b7" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805b7">trial</a> is the atomic unit of work: one run of your agent on one <a href="https://www.harborframework.com/docs/core-concepts#task" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805bb" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805bb">task</a>. Because agents are non-deterministic, you usually run each task more than once  <code>n_attempts</code> is how many times Harbor repeats every task and averages the scores so a single lucky or unlucky run does not define the result. Your whole <a href="https://www.harborframework.com/docs/core-concepts#job" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805c1" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805c1">job</a> is therefore <code>n_attempts × tasks</code>: every task, run <code>n_attempts</code> times, each repetition its own trial. Harbor orchestrates all of it.For each <a href="https://www.harborframework.com/docs/core-concepts#trial" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805cd" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805cd">trial</a>, Harbor provisions a fresh sandbox and copies in everything that run needs: your agent code, the <a href="https://www.harborframework.com/docs/core-concepts#task" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805d0" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805d0">task</a> (cached on disk, then loaded into the sandbox VM), and whatever starting files the run begins from. It then runs the agent against the instruction, runs the verifier, and records the result. Harbor averages across trials into a single job result with the metrics you care about.</div><div id="unifying-langsmith-observability-with-harbor"><h3>Unifying LangSmith Observability with Harbor</h3>The <code>harbor-langsmith</code> integration brings first-class support for LangSmith tracing into Harbor, plus logging to <a href="https://www.harborframework.com/docs/core-concepts#dataset" data-wf-native-id-path="0304cc32-f5b1-90e6-ebf1-33628a9805dd" data-wf-ao-click-engagement-tracking="true" data-wf-element-id="0304cc32-f5b1-90e6-ebf1-33628a9805dd">datasets</a> and experiments.Enable it with a single flag, <code>--plugin langsmith</code>. Harbor then records every job to LangSmith: it syncs the dataset, creates an experiment, and logs a run per trial with the verifier’s reward as feedback. If the agent under test supports LangSmith tracing, those traces attach directly to the experiment - so you get the full step-by-step trajectory alongside the score. If it does not trace, you still get the dataset, experiment, results, and feedback.Under Datasets & Experiments we are able to view all of our active datasets that are being used.<figure><img alt="" src="https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a42c2a4d13c54e4096c8695_Screenshot%202026-06-22%20at%2010.58.47%E2%80%AFAM.png" loading="eager"></figure>An experiment is an entire run on a given dataset. To view the specific experiments and their respective scores and statistics for a given dataset, click into it.<figure><img alt="" src="https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a42c27efbcc34afc06153ad_Screenshot%202026-06-22%20at%202.38.08%E2%80%AFPM.png" loading="eager"></figure>We believe integrating traces into evals lets you further refine your evals, and in turn better understand and improve your agents. The score tells you whether a trial passed; the trace tells you why.</div></div><div id="the-result-a-full-eval-stack-for-agents"><h2>The result: a full eval stack for agents</h2>Put together, this is a complete stack for evaluating agents, where each layer does one job well:<ul role="list"><li>Harbor - the eval harness that orchestrates trials.</li><li>Deep Agents - for building the agents under test.</li><li>LangSmith sandboxes - the isolated cloud execution environment.</li><li>LangSmith - the system of record for datasets, experiments, traces, and scores.</li></ul>And the part you bring stays small:<ul role="list"><li>Your agent, with or without tracing.</li><li>Your dataset, remote from a registry or local on disk.</li><li>Your cloud sandbox — LangSmith, with <code>-e langsmith</code>.</li><li>Your UI view — <code>--plugin langsmith</code>.</li></ul>If you have a LangSmith account and a dataset, you can try the whole thing by installing Harbor with the <code>langsmith</code> extra, which brings both the LangSmith sandbox environment and the eval plugin. Then set your LangSmith and model credentials, and turn on tracing so the agent's traces attach to the experiment:

harbor run \ --agent langgraph \ --model fireworks:accounts/fireworks/models/glm-5p2 \ # agent --ak project_path=./deep-agent --ak graph=deep_agent \ -d terminal-bench@2.0 \ # dataset of tasks -e langsmith \ # cloud environment --plugin langsmith

[Read the Harbor integrations docs](https://docs.langchain.com/langsmith/harbor-integrations) to get started. For more on running evals in Harbor, see [Run evals](https://www.harborframework.com/docs/run-jobs/run-evals). ‍ ### Related content ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a19109c2af7527cb5ec809e_logo%20and%20title%20-%2020%20characters%20max%20(8).png) Partner Deep Agents LangSmith ### EU macroeconomic analysis with Deep Agents, LangSmith, and the You.com Finance Research API ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd2d50cd0f84dacf92e7b_ProfilePic.png) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd29baf4c28709e2566a7_headshot.jpg) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a0ddbeae5c6993cbe1a4a8b_Screenshot%202026-05-20%20at%209.05.52%E2%80%AFAM.png) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/6a0ddba248048cac2a92774c_image%20(4).png) S. Tangedipalli, K. Singh, S. Sharma, A. Pothana May 20, 2026 ![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg) 18 min ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc07193192cebc73980fd3_logo%20and%20title%20-%2020%20characters%20max%20(6).png) Partner Deep Agents ### Building a company due diligence agent with Deep Agents, LangSmith and Parallel ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc01c6959ca5fd924ab432_MattHarris.jpg) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fc01b812793b72539057d5_nick%20headshot.jpeg) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd2d50cd0f84dacf92e7b_ProfilePic.png) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69fbd29baf4c28709e2566a7_headshot.jpg) M. Harris, N. Martitsch, S. Tangedipalli, K. Singh May 8, 2026 ![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg) 9 min ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e23754937c2f749d12bb0b_76%20(1).png) Agent Architecture Partner ### Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e234176723e6111407b935_renuka-kumar.png) ![](https://cdn.prod.website-files.com/65c81e88c254bb0f97633a71/69e23427e77d2631610e5d62_Prashanth-Ramagopal.png) R. Kumar, P. Ramagopal April 17, 2026 ![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69cd1fd0002272ce39bf1241_Icon-6.svg) 11 min ![](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/69ce01ea562f8cc223cabf25_Frame%202147254328.svg) ### See what your agent is really doing LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

Harbor と LangChain の統合：エージェント評価のための統一スタック

キーポイント

影響分析

編集コメント

主要なポイント

Harbor の仕組み

LangChain エージェントと Harbor の統合

LangSmith のサンドボックスを Harbor で統一する

LangSmith の観測機能と Harbor の統合

結果：エージェント向けの完全な評価スタック

関連コンテンツ

Deep Agents、LangSmith、および You.com Finance Research API を用いた EU のマクロ経済分析

Deep Agents、LangSmith、Parallel を用いた企業デューデリジェンスエージェントの構築

エージェント工学：AI エージェントの群れがソフトウェアエンジニアリングを再定義する

エージェントの実際の動作を確認する

Key Takeaways

How Harbor works

Unifying LangChain agents with Harbor

関連記事

Harbor と LangChain の統合：エージェント評価のための統一スタック

キーポイント

影響分析

編集コメント

主要なポイント

Harbor の仕組み

LangChain エージェントと Harbor の統合

LangSmith のサンドボックスを Harbor で統一する

LangSmith の観測機能と Harbor の統合

結果：エージェント向けの完全な評価スタック

関連コンテンツ

Deep Agents、LangSmith、および You.com Finance Research API を用いた EU のマクロ経済分析

Deep Agents、LangSmith、Parallel を用いた企業デューデリジェンスエージェントの構築

エージェント工学：AI エージェントの群れがソフトウェアエンジニアリングを再定義する

エージェントの実際の動作を確認する

Key Takeaways

How Harbor works

Unifying LangChain agents with Harbor

関連記事