読み込み中…

LangChain Blog·2026年6月5日 02:35·約12分

LangGraph の耐障害性：リトライ、タイムアウト、エラーハンドラー

#LLM エージェント #LangGraph #ソフトウェア工学 #フォールトトレランス #LangChain

TL;DR

LangChain Blog は、実運用環境におけるエージェントの信頼性を高めるため、LangGraph が提供するリトライ、タイムアウト、エラーハンドリングという 3 つのフォールトトレランス機能について詳細な解説を行っている。

AI深層分析2026年6月5日 09:13

重要/ 5段階

深度40%

キーポイント

実運用におけるエラーの現実性

プロトタイプでは見えないネットワーク障害やレート制限などのエラーが発生する現実を踏まえ、単なる「ハッピーパス」だけでなく、生産環境で動作し続けるための堅牢な設計が不可欠であると指摘している。

LangGraph の 3 つのフォールトトレランス機能

エージェントのノードごとに適用可能な `RetryPolicy`（自動リトライ）、`TimeoutPolicy`（時間制限）、および `error_handler`（最終的なエラー処理）という 3 つの主要な仕組みを解説している。

ワークフローエンジン内での統合的実装

これらの機能は外部ロジックではなく、LangGraph の実行制御内で定義されることで、リトライ失敗後の補償処理（compensation logic）やコンテキストの保持をシームレスに実現可能である。

重要な引用

Writing the happy path is usually the easy part. The error handling boilerplate that makes it survive in production (retries, timeouts, fallbacks) is often longer than the business logic itself.

LangGraph models your agent as a set of discrete steps (nodes), organized as a graph.

影響分析・編集コメントを表示

影響分析

この記事は、LLM エージェント開発の次の段階である「実験から実運用へ」への移行において、開発者が直面する最大の課題の一つである信頼性確保のための具体的な解決策を提供しています。LangGraph のようなワークフローエンジンがネイティブにフォールトトレランス機能を提供することで、開発者は複雑なエラー処理ロジックを自作する負担から解放され、より堅牢で持続可能なエージェントシステムを構築できるようになります。

編集コメント

実運用レベルの AI エージェント開発において、エラー処理は避けて通れない課題であり、この記事で解説される LangGraph の機能活用は即座に現場で応用可能な重要な知見です。

現実世界では、プロトタイプでは決して遭遇しないエラーにエージェントが直面します：ネットワーク障害、ツール呼び出しのエラー、LLM のレート制限など。

何時間もあるいは数日間実行されているタスクが途中で回復不能なエラーに陥ったと想像してください。どうしますか？その実行を放棄して完全に最初からやり直しますか？これは本番環境でエージェントを実行するための持続可能な方法ではありません。

ハッピーパス（成功するケース）の記述は通常容易です。本番環境で生き残るために必要なエラーハンドリングのボイラープレート（リトライ、タイムアウト、フォールバックなど）は、ビジネスロジックそのものよりも長くなることがよくあります。

LangGraph は、エージェントをグラフとして整理された一連の離散ステップ（ノード）としてモデル化します。典型的なエージェントの場合、これはモデルを呼び出すノード、その結果返されるツール呼び出しを実行するノード、そしてそのループを囲むように配置したい任意の決定論的ロジックから構成されます。LangGraph が実行を制御するため、これらのステップのいずれかが失敗した際に何が起こるかを処理するのもここで行われます。

この投稿では、フォールトトレランスを実現するために LangGraph が提供する 3 つのプリミティブ（基本機能）について解説し、それらがどのように組み合わされるか、また補償ロジックを考慮し始めた際にワークフローエンジン内にこれらを持つことがなぜ重要なのかについて説明します。

3 つのプリミティブとは以下の通りです：

RetryPolicy: 一時的なエラーに対するバックオフ/ジッターを伴う自動リトライ。
TimeoutPolicy: ノード試行に対する壁時計ベースまたは進捗ベースのカットオフ。
error_handler: リトライが尽きた後に実行され、失敗コンテキストが付与されるノード。

LangGraph では、StateGraph にノードとエッジを追加することでエージェントを定義します。これら 3 つのプリミティブはすべて add_node を介して直接ノードに付与されるため、フォールトトレランス設定は保護対象ロジックのすぐ隣に配置されます。（デフォルトを一度だけ設定したい場合は、set_node_defaults を参照してください。）‍

from langgraph.graph import StateGraph

from langgraph.types import RetryPolicy, TimeoutPolicy

from langgraph.errors import NodeError

(

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4, backoff_factor=2.0),

timeout=TimeoutPolicy(run_timeout=30, idle_timeout=5),

error_handler=handle_model_failure,

)

...

)

リトライから始める

一時的な障害は、非自明なグラフにおける最も一般的な種類の障害です：LLM プロバイダーが 5xx を返す、ベクトルストアで接続リセットが発生する、下流の HTTP サービスが一時的に利用不可になるなど。これらすべては本質的に「少し待って再試行すればおそらく成功する」タイプのエラーです。

第一級サポートがない場合、結局すべてのノード内で同じラッパーを書き続けることになります：

def call_llm(state):

# ~25 lines of "retry with backoff, but only on 5xx,

# don't retry on 4xx, log each attempt, sleep with jitter"

...

LangGraph の RetryPolicy は、そのようなボイラープレートコードを不要にします。これは *ノードごとの試行* に適用され、指数バックオフ、オプションのジッター（ランダム化）、およびどの例外が再試行対象となるかを判定する設定可能な述語（predicate）をサポートしています:

from langgraph.types import RetryPolicy

policy = RetryPolicy(

initial_interval=0.5,

backoff_factor=2.0,

max_interval=128.0,

max_attempts=3,

jitter=True,

retry_on=(ConnectionError, TimeoutError), # または呼び出し可能な関数

)

デフォルトの retry_on は意図的に保守的です：これは ConnectionError、httpx/requests からの 5xx レスポンス、およびいくつかの一般的な一時的なカテゴリに対して再試行を行います。

デフォルトでは、ValueError、TypeError、RuntimeError などには再試行しません。これらはほぼ常にプログラミング上のバグであるためです。

retry_on の仕様は、エラータイプのコレクション または 実行時にエラーをチェックして再試行条件に一致するかどうかを判定する呼び出し可能な関数（callable）のいずれかになります。

Timeout: 「一時的な失敗」の特殊ケース

タイムアウトとは、実質的に「試行が長時間ハングしているため、一時的な失敗として扱われる」ということです。明示的なタイムアウトを設定しないと、停止した HTTP 呼び出しやフリーズしたサブプロセスが、グラフの実行を無限に停止させる可能性があります。

LangGraph の TimeoutPolicy は、2 つ種類のタイムアウトをサポートしています：‍

from langgraph.types import TimeoutPolicy

TimeoutPolicy(

run_timeout=30.0, # 単一試行に対する厳格な壁時計による上限

idle_timeout=5.0, # 観測可能な進捗がない場合の最大時間

refresh_on="auto", # または "heartbeat"

)

‍run_timeout は、単一の試行に対する厳格な壁時計による上限です。ノードに対して N 秒以上待機したくない場合に有用です。
idle_timeout は、すべての「進捗」シグナルでリセットされます：チャネルへの書き込み、ストリーミングされたチャンク（LangChain LLM モデルから自動的に発行される）、子タスクイベント、LangChain コールバックイベント。長時間実行されつつも積極的にストリーミングされている作業はトリガーされませんが、完全にフリーズした呼び出しはトリガーします。内部では、すべてのシグナルに対して「ハートビート」に依存しています。もしあなたが作業を制御し、独自の進捗ビートを発行する場合は、refresh_on="heartbeat" に切り替えて、ノードの内部から明示的に runtime.heartbeat() を呼び出すことができます。

タイムアウトが発生すると、ノード試行はキャンセルされ、NodeTimeoutError がスローされます。

エラーハンドラ：リトライでは不十分な場合

リトライは「5 秒もすればおそらくうまくいく」というケースを処理します。しかし、リトライの上限に達した場合や、何らかのロジックを実行する必要があるケースは処理できません。例えば、「6 回試したが、支払いプロバイダはまだダウンしており、今後は以下の対応が必要である」ような場合です。

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等)は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

注文のステータスを失敗としてマークし、顧客に通知するか、
すでにコミットした部分的な副作用をロールバックするか、
システムの残りの部分が反応できるように「支払い失敗」イベントを公開する」といった対応が可能です。

リトライが尽きた後のエラーハンドラーには多くのユースケースがあります。これにはクリーンアップ、アラート通知、デッドレターへの書き込み、より安価なモデルへのフォールバックパス、あるいは単に「申し訳ありません」というメッセージへのルーティングなどが含まれます。

LangGraph では、これは自然にサポートされています（ドキュメント：エラーハンドリング）:

from langgraph.errors import NodeError

def on_call_llm_failed(state: State, error: NodeError) -> State:

log.error("call_llm failed after retries:%s", error.error)

return {"status": "llm_unavailable"}

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4),

error_handler=on_call_llm_failed,

)

‍この実装の wiring（接続）について、いくつか注目すべき点があります。

リトライが尽きた後にのみ発火します。 これがこの機能を実際に有用にする特性です。すべての例外で実行したい場合は、ノード内部に try/except を記述するだけで済みます。

失敗コンテキストが注入されます。 ハンドラーは NodeError パラメータを使用して、失敗したノードの名前と例外（error.node, error.error）を取得できます。

遷移は原子的です。元のノードが失敗すると、その ERROR 書き込みがチェックポイントにコミットされ、ハンドラタスクが同じステップ内で新しいタスクとしてスケジュールされます。これは、エラーハンドラのステップに入ってから通常のステップに戻ることができない重要なプロセスにおいて極めて重要です。もしホストプロセスがハンドラー実行中にクラッシュした場合、次回の実行では元の失敗したノードではなく、ハンドラーの再スケジューリングから再開されます。

*エラーハンドラは同じ実行サイクル内で動作します。** ノードが失敗すると、エラーハンドラはそのステップで既に実行中だった他のノードと並行して即座にスケジュールされます。それらが完了するのを待つのでもなく、他のノードもそれを待つことはありません。

*LangGraph では、「実行サイクル」は「スーパーステップ」と呼ばれます（ランタイムに精通している方ならご存知でしょう）。

各ノードに対してデフォルトのハンドラを設定できます。 set_node_defaults は、独自のハンドラを指定しないすべての通常ノードに適用されますが、個別ノードで error_handler= を指定した場合は常に優先されます。

エラーハンドラに対して別のエラーハンドラを設定することはできません。 これにより、無限再帰の動作は防止されます。

組み合わせる：耐障害性の高いフライト予約

上記の 3 つのプリミティブは自然に組み合わさりますが、その真価が発揮されるのは、副作用を伴うワークフロー、つまり現実世界の状態を変更する操作においてです。フライト予約を考えてみましょう。それは単一のアクションではなく、一連のステップです。座席の確保、決済処理、チケット発行。各ステップは外部システムと通信します。どれが失敗してもおかしくありません。

単純なアプローチ（すべてを再試行するだけ）はすぐに破綻します。座席の確保は成功したが、決済やチケット発行に失敗した場合、予約は不良状態のまま固定されてしまいます。実際に必要なのは、各ステップを個別に再試行することであり、あるステップが再試行回数を尽きた場合は、すでに実行済みのステップ（失敗したステップも含む。なぜならそのステータスが不明だから）のみをロールバックすることです。

これは SAGA パターンと呼ばれ、すべての処理を単一のデータベーストランザクションでラップできない分散システムにおける障害対応の標準的な方法論です。

LangGraph では以下のように実装されます：

from typing import TypedDict, Annotated, Literal

import operator

from langgraph.graph import StateGraph, START, END

from langgraph.types import Command, RetryPolicy

from langgraph.errors import NodeError

class BookingState(TypedDict, total=False):

booking_id: str

passenger: str

flight: str

seat: str # assigned once the seat is reserved

amount: int # fare to charge, in minor units

payment_ref: str # set once payment is captured

ticket_no: str # set once the ticket is issued

completed: Annotated[list[str], operator.add] # accumulates per-step

def to_compensate(state: BookingState, error: NodeError) -> Command:

"""Route any retry-exhausted step to the compensation node."""

return Command(

"""include the failed node"""

update={"completed": [f"FAILED:{error.node}"]},

goto="compensate",

)

def reserve_seat(state) -> BookingState:

# Call the seat-inventory service to hold a seat for this itinerary.

...

return {"seat": "12A", "completed": ["reserve_seat"]}

def process_payment(state) -> BookingState:

# Charge the fare via the payment processor while the seat is held.

...

return {"payment_ref": "pay_abc123", "completed": ["process_payment"]}

def issue_ticket(state) -> BookingState:

# Confirm the seat and issue the ticket once payment is captured.

...

return {"ticket_no": "TKT-7788", "completed": ["issue_ticket"]}

def compensate(state) -> Command[Literal["__end__"]]:

# Inspect state["completed"] and undo only the steps that actually ran,

# in reverse order, to keep the booking all-or-nothing.

if "issue_ticket" in state["completed"]:

void_ticket(state)

if "process_payment" in state["completed"]:

refund_payment(state)

if "reserve_seat" in state["completed"]:

release_seat(state)

return Command(goto=END)

graph = (

StateGraph(BookingState)

# All steps share the same retry policy and the same fallback target;

# per-step overrides are still possible.

.set_node_defaults(retry_policy=RETRYABLE, error_handler=to_compensate)

.add_node("reserve_seat", reserve_seat)

.add_node("process_payment", process_payment)

.add_node("issue_ticket", issue_ticket)

.add_node("compensate", compensate)

.add_edge(START, "reserve_seat")

.add_edge("reserve_seat", "process_payment")

.add_edge("process_payment", "issue_ticket")

.add_edge("issue_ticket", END)

.compile(checkpointer=checkpointer)

)

これにより得られるもの:

設定されたポリシーに基づくステップごとのバックオフ再試行
いずれかのステップの再試行が尽きた場合、補償への原子的な遷移
実際に完了したステップを追跡する永続的な状態管理。これにより、補償は戻す必要がある部分のみを元に戻します

最後の言葉

エージェントはより多くの自律性を担うようになり、それとともに行動する力も増しています。フライトの予約、チケットの提出、支払いの実行、内部サービスへの呼び出しなど、実行するアクションは次第に重大な結果を伴い、取り消しが困難なものになっています。

これは信頼性に対する基準を引き上げます。デモにおける 1% の一時的な失敗率は些細な不便さで済みますが、数十のステップと現実世界の結果を伴う本番環境のエージェントでは、すぐに悪化します。

RetryPolicy（再試行ポリシー）、TimeoutPolicy（タイムアウトポリシー）、および error_handler（エラーハンドラー）は LangGraph に組み込まれており、あらゆる種類のエラーに耐性のあるエージェントを構築しやすくしています。必要なことは、ユースケースに適したポリシーを定義するだけで、LangGraph エージェントランタイムがそれ以外の処理を担当します。

始め方: 公式のフォールトトレランスドキュメントで、ノードごとの再試行、タイムアウト、エラーハンドラーを設定してください。

原文を表示

In the real world, agents hit errors that prototypes never see: network failures, tool call errors, LLM rate limits.

Imagine you have a task that's been running for hours or days that hits an unrecoverable error halfway through. What do you do? Abandon the run and completely start over? This isn’t a sustainable way to run production agents.

Writing the happy path is usually the easy part. The error handling boilerplate that makes it survive in production (retries, timeouts, fallbacks) is often longer than the business logic itself.

LangGraph models your agent as a set of discrete steps (nodes), organized as a graph. For a typical agent, that's a node that calls the model, a node that runs any tool calls it returns, and any deterministic logic you want to wrap around that loop. Because LangGraph controls execution, it's also where you handle what happens when any of those steps fail.

This post walks through the three primitives LangGraph gives you for fault tolerance, how they compose, and why having them inside the workflow engine matters once you start thinking about compensation logic.

The three primitives are:

RetryPolicy: automatic retries with backoff/jitter for transient errors.
TimeoutPolicy: a wall-clock or progress-based cap on a node attempt.
error_handler: a node that runs after retries are exhausted, with the failure context attached.

In LangGraph, you define your agent by adding nodes and edges to a StateGraph. All three primitives attach directly to a node via add_node, so your fault tolerance config lives right next to the logic it protects. (If you want to configure defaults once, see set_node_defaults.)‍

code

from langgraph.graph import StateGraph

from langgraph.types import RetryPolicy, TimeoutPolicy

from langgraph.errors import NodeError

(

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4, backoff_factor=2.0),

timeout=TimeoutPolicy(run_timeout=30, idle_timeout=5),

error_handler=handle_model_failure,

)

...

)

code

Starting from retries

Transient failures are the most common kind of failure in any non-trivial graph: an LLM provider returns a 5xx, a vector store hits a connection reset, a downstream HTTP service is briefly unavailable. Every one of these is fundamentally a “try again in a moment and it’ll probably work” kind of error.

Without first-class support you end up writing the same wrapper inside every node:

code

def call_llm(state):

~25 lines of "retry with backoff, but only on 5xx,

don't retry on 4xx, log each attempt, sleep with jitter"

...

code

LangGraph’s RetryPolicy removes that boilerplate. It applies *per node attempt*, with exponential backoff, optional jitter, and a configurable predicate for which exceptions count as retryable:

code

from langgraph.types import RetryPolicy

policy = RetryPolicy(

initial_interval=0.5,

backoff_factor=2.0,

max_interval=128.0,

max_attempts=3,

jitter=True,

retry_on=(ConnectionError, TimeoutError), # or a callable

)

code

The default retry_on is intentionally conservative: it retries ConnectionError, 5xx responses from httpx/requests, and a few generic transient categories.

By default it does not retry ValueError, TypeError, RuntimeError, etc., which are almost always programming bugs.

The retry_on spec can be a collection of error types or a callable that checks an error at runtime to see if it matches retry criteria.

Timeout: a special case of “transient failure”

A timeout is really just “the attempt is treated as a transient failure because it’s been hanging too long.” Without an explicit timeout, a stuck HTTP call or a frozen subprocess can hang a graph run indefinitely.

LangGraph’s TimeoutPolicy supports two types of timeouts:‍

code

from langgraph.types import TimeoutPolicy

TimeoutPolicy(

run_timeout=30.0, # hard wall-clock cap on a single attempt

idle_timeout=5.0, # max time without observable progress

refresh_on="auto", # or "heartbeat"

)

code

‍run_timeout is a hard wall-clock cap on a single attempt. Useful when you simply do not care to ever wait more than N seconds for a node.

idle_timeout resets on every “progress” signal: channel writes, streamed chunks (automatically emitted from LangChain LLM models), child task events, LangChain callback events. Long-running but actively-streaming work doesn’t trip it, but a truly hung call does.Internally, it relies on “heartbeat” for every signal. If you control the work and emit your own progress beats, you can switch to refresh_on="heartbeat" and explicitly call runtime.heartbeat() from inside the node.

When a timeout fires, the node attempt is cancelled and a NodeTimeoutError is raised.

Error handlers: when retries aren’t enough

Retries handle “this will probably work in 5 seconds.” However, they don’t handle the cases where retry exhaustion and you need to run some logic. For example, “we’ve tried six times, the payment provider is still down, and now you need to:

mark the order as failed and notify the customer, or
roll back the partial side effects we already committed, or
publish a payment.failed event for the rest of the system to react to.”

There are a lot of use cases for error handlers after retry exhaustion. This includes cleanup, alerting, dead-letter writes, fallback paths to a cheaper model, or just routing to a “we apologize” message.

In LangGraph, this is now supported naturally (docs: Error handling):‍

code

from langgraph.errors import NodeError

def on_call_llm_failed(state: State, error: NodeError) -> State:

log.error("call_llm failed after retries:%s", error.error)

return {"status": "llm_unavailable"}

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4),

error_handler=on_call_llm_failed,

)

code

‍A few things to notice about how this is wired:

It only fires after retries are exhausted. This is the property that makes the feature actually useful. If you want to run on every exception, you’d just need to write a try/except inside the node.

The failure context is injected. The handler can use parameter as NodeError to get the failing node’s name plus the exception (error.node, error.error).

The transition is atomic. When the original node fails, its ERROR write is committed to the checkpoint, and the handler task is scheduled as a new task in the same step. This is crucial in some critical processes where you can’t go back to the regular steps after entering the error-handler steps. If the host process crashes mid-handler, next time it will resume the run re-schedules the handler, not the original failing node

*The error handler runs in the same execution cycle.** When a node fails, the error handler is scheduled immediately alongside any other nodes that were already running in that step. It doesn't wait for them to finish, and they don't wait for it.

*in LangGraph, we call an “execution cycle” a “superstep”, if you’re familiar with the runtime.

You can set a default handler for every node. set_node_defaults applies to every regular node that doesn’t specify its own, but a per-node error_handler= always wins.

You can’t set another error handler for an error handler. So you don’t get infinite-recursion behavior.

Putting it together: fault tolerant flight booking

The three primitives above compose naturally, but their real power shows up in workflows that involve side effects: operations that change real-world state. Consider a flight booking: it's not one action, it's a sequence. Reserve a seat, process payment, issue a ticket. Each step talks to an external system. Any of them can fail.

The naive approach (just retry the whole thing) breaks down fast. If the reserving seat went through but the payment or issuing ticket fails, the reservation is stuck in a bad state . What you actually need is to retry each step individually, and if a step exhausts its retries, undo only the steps that already ran(including failed one because it’s unknown).

This is called the SAGA pattern, and it's a standard way to handle failures in distributed systems where you can't wrap everything in a single database transaction.

Here's what that looks like in LangGraph:

code

from typing import TypedDict, Annotated, Literal

import operator

from langgraph.graph import StateGraph, START, END

from langgraph.types import Command, RetryPolicy

from langgraph.errors import NodeError

class BookingState(TypedDict, total=False):

booking_id: str

passenger: str

flight: str

seat: str # assigned once the seat is reserved

amount: int # fare to charge, in minor units

payment_ref: str # set once payment is captured

ticket_no: str # set once the ticket is issued

completed: Annotated[list[str], operator.add] # accumulates per-step

def to_compensate(state: BookingState, error: NodeError) -> Command:

"""Route any retry-exhausted step to the compensation node."""

return Command(

"""include the failed node"""

update={"completed": [f"FAILED:{error.node}"]},

goto="compensate",

)

def reserve_seat(state) -> BookingState:

Call the seat-inventory service to hold a seat for this itinerary.

...

return {"seat": "12A", "completed": ["reserve_seat"]}

def process_payment(state) -> BookingState:

Charge the fare via the payment processor while the seat is held.

...

return {"payment_ref": "pay_abc123", "completed": ["process_payment"]}

def issue_ticket(state) -> BookingState:

Confirm the seat and issue the ticket once payment is captured.

...

return {"ticket_no": "TKT-7788", "completed": ["issue_ticket"]}

def compensate(state) -> Command[Literal["__end__"]]:

Inspect `state["completed"]` and undo only the steps that actually ran,

in reverse order, to keep the booking all-or-nothing.

if "issue_ticket" in state["completed"]:

void_ticket(state)

if "process_payment" in state["completed"]:

refund_payment(state)

if "reserve_seat" in state["completed"]:

release_seat(state)

return Command(goto=END)

graph = (

StateGraph(BookingState)

All steps share the same retry policy and the same fallback target;

per-step overrides are still possible.

.set_node_defaults(retry_policy=RETRYABLE, error_handler=to_compensate)

.add_node("reserve_seat", reserve_seat)

.add_node("process_payment", process_payment)

.add_node("issue_ticket", issue_ticket)

.add_node("compensate", compensate)

.add_edge(START, "reserve_seat")

.add_edge("reserve_seat", "process_payment")

.add_edge("process_payment", "issue_ticket")

.add_edge("issue_ticket", END)

.compile(checkpointer=checkpointer)

)

code

What this gives you:

Per-step backoff retries with the configured policy
An atomic transition into compensate once any step's retries are exhausted
Persistent state tracking which steps actually completed, so compensate only undoes what needs to revert

Final words

Agents are taking on more autonomy, and with that comes more power to act. They're booking flights, filing tickets, executing payments, calling internal services. The actions they take are increasingly high-consequence and difficult to reverse.

That raises the bar for reliability. A 1% transient failure rate is a minor inconvenience in a demo. In a production agent with dozens of steps and real-world consequences, it compounds quickly.

RetryPolicy, TimeoutPolicy, and error_handler are built into LangGraph so that it’s easy to build an agent that’s resilient to all sorts of errors. All you have to do is define policies that make sense for your use case, and the LangGraph agent runtime handles the rest.

Get started: configure per-node retries, timeouts, and error handlers with the official Fault tolerance docs.

‍

この記事をシェア

LangChain Blog2026年7月21日 02:00

LangChain、評価ベンチ「IssueBench」発表

LangChain Blog重要度42026年7月21日 00:46

LangChain、ガバナンス型エージェント枠組みを発表

LangChain Blog2026年7月22日 03:27

Apollo が Deep Agents と LangSmith で GTM AI を実現

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

LangChain Blog·2026年6月5日 02:35·約12分

LangGraph の耐障害性：リトライ、タイムアウト、エラーハンドラー

#LLM エージェント #LangGraph #ソフトウェア工学 #フォールトトレランス #LangChain

TL;DR

AI深層分析2026年6月5日 09:13

重要/ 5段階

深度40%

キーポイント

実運用におけるエラーの現実性

LangGraph の 3 つのフォールトトレランス機能

ワークフローエンジン内での統合的実装

重要な引用

Writing the happy path is usually the easy part. The error handling boilerplate that makes it survive in production (retries, timeouts, fallbacks) is often longer than the business logic itself.

LangGraph models your agent as a set of discrete steps (nodes), organized as a graph.

影響分析・編集コメントを表示

影響分析

編集コメント

3 つのプリミティブとは以下の通りです：

RetryPolicy: 一時的なエラーに対するバックオフ/ジッターを伴う自動リトライ。
TimeoutPolicy: ノード試行に対する壁時計ベースまたは進捗ベースのカットオフ。
error_handler: リトライが尽きた後に実行され、失敗コンテキストが付与されるノード。

from langgraph.graph import StateGraph

from langgraph.types import RetryPolicy, TimeoutPolicy

from langgraph.errors import NodeError

(

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4, backoff_factor=2.0),

timeout=TimeoutPolicy(run_timeout=30, idle_timeout=5),

error_handler=handle_model_failure,

)

...

)

リトライから始める

第一級サポートがない場合、結局すべてのノード内で同じラッパーを書き続けることになります：

def call_llm(state):

# ~25 lines of "retry with backoff, but only on 5xx,

# don't retry on 4xx, log each attempt, sleep with jitter"

...

from langgraph.types import RetryPolicy

policy = RetryPolicy(

initial_interval=0.5,

backoff_factor=2.0,

max_interval=128.0,

max_attempts=3,

jitter=True,

retry_on=(ConnectionError, TimeoutError), # または呼び出し可能な関数

)

デフォルトでは、ValueError、TypeError、RuntimeError などには再試行しません。これらはほぼ常にプログラミング上のバグであるためです。

Timeout: 「一時的な失敗」の特殊ケース

LangGraph の TimeoutPolicy は、2 つ種類のタイムアウトをサポートしています：‍

from langgraph.types import TimeoutPolicy

TimeoutPolicy(

run_timeout=30.0, # 単一試行に対する厳格な壁時計による上限

idle_timeout=5.0, # 観測可能な進捗がない場合の最大時間

refresh_on="auto", # または "heartbeat"

)

‍run_timeout は、単一の試行に対する厳格な壁時計による上限です。ノードに対して N 秒以上待機したくない場合に有用です。
idle_timeout は、すべての「進捗」シグナルでリセットされます：チャネルへの書き込み、ストリーミングされたチャンク（LangChain LLM モデルから自動的に発行される）、子タスクイベント、LangChain コールバックイベント。長時間実行されつつも積極的にストリーミングされている作業はトリガーされませんが、完全にフリーズした呼び出しはトリガーします。内部では、すべてのシグナルに対して「ハートビート」に依存しています。もしあなたが作業を制御し、独自の進捗ビートを発行する場合は、refresh_on="heartbeat" に切り替えて、ノードの内部から明示的に runtime.heartbeat() を呼び出すことができます。

タイムアウトが発生すると、ノード試行はキャンセルされ、NodeTimeoutError がスローされます。

エラーハンドラ：リトライでは不十分な場合

{"translation": "翻訳全文"}

注文のステータスを失敗としてマークし、顧客に通知するか、
すでにコミットした部分的な副作用をロールバックするか、
システムの残りの部分が反応できるように「支払い失敗」イベントを公開する」といった対応が可能です。

LangGraph では、これは自然にサポートされています（ドキュメント：エラーハンドリング）:

from langgraph.errors import NodeError

def on_call_llm_failed(state: State, error: NodeError) -> State:

log.error("call_llm failed after retries:%s", error.error)

return {"status": "llm_unavailable"}

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4),

error_handler=on_call_llm_failed,

)

‍この実装の wiring（接続）について、いくつか注目すべき点があります。

*LangGraph では、「実行サイクル」は「スーパーステップ」と呼ばれます（ランタイムに精通している方ならご存知でしょう）。

エラーハンドラに対して別のエラーハンドラを設定することはできません。 これにより、無限再帰の動作は防止されます。

組み合わせる：耐障害性の高いフライト予約

LangGraph では以下のように実装されます：

from typing import TypedDict, Annotated, Literal

import operator

from langgraph.graph import StateGraph, START, END

from langgraph.types import Command, RetryPolicy

from langgraph.errors import NodeError

class BookingState(TypedDict, total=False):

booking_id: str

passenger: str

flight: str

seat: str # assigned once the seat is reserved

amount: int # fare to charge, in minor units

payment_ref: str # set once payment is captured

ticket_no: str # set once the ticket is issued

completed: Annotated[list[str], operator.add] # accumulates per-step

def to_compensate(state: BookingState, error: NodeError) -> Command:

"""Route any retry-exhausted step to the compensation node."""

return Command(

"""include the failed node"""

update={"completed": [f"FAILED:{error.node}"]},

goto="compensate",

)

def reserve_seat(state) -> BookingState:

# Call the seat-inventory service to hold a seat for this itinerary.

...

return {"seat": "12A", "completed": ["reserve_seat"]}

def process_payment(state) -> BookingState:

# Charge the fare via the payment processor while the seat is held.

...

return {"payment_ref": "pay_abc123", "completed": ["process_payment"]}

def issue_ticket(state) -> BookingState:

# Confirm the seat and issue the ticket once payment is captured.

...

return {"ticket_no": "TKT-7788", "completed": ["issue_ticket"]}

def compensate(state) -> Command[Literal["__end__"]]:

# Inspect state["completed"] and undo only the steps that actually ran,

# in reverse order, to keep the booking all-or-nothing.

if "issue_ticket" in state["completed"]:

void_ticket(state)

if "process_payment" in state["completed"]:

refund_payment(state)

if "reserve_seat" in state["completed"]:

release_seat(state)

return Command(goto=END)

graph = (

StateGraph(BookingState)

# All steps share the same retry policy and the same fallback target;

# per-step overrides are still possible.

.set_node_defaults(retry_policy=RETRYABLE, error_handler=to_compensate)

.add_node("reserve_seat", reserve_seat)

.add_node("process_payment", process_payment)

.add_node("issue_ticket", issue_ticket)

.add_node("compensate", compensate)

.add_edge(START, "reserve_seat")

.add_edge("reserve_seat", "process_payment")

.add_edge("process_payment", "issue_ticket")

.add_edge("issue_ticket", END)

.compile(checkpointer=checkpointer)

)

これにより得られるもの:

設定されたポリシーに基づくステップごとのバックオフ再試行
いずれかのステップの再試行が尽きた場合、補償への原子的な遷移
実際に完了したステップを追跡する永続的な状態管理。これにより、補償は戻す必要がある部分のみを元に戻します

最後の言葉

始め方: 公式のフォールトトレランスドキュメントで、ノードごとの再試行、タイムアウト、エラーハンドラーを設定してください。

原文を表示

In the real world, agents hit errors that prototypes never see: network failures, tool call errors, LLM rate limits.

Writing the happy path is usually the easy part. The error handling boilerplate that makes it survive in production (retries, timeouts, fallbacks) is often longer than the business logic itself.

The three primitives are:

RetryPolicy: automatic retries with backoff/jitter for transient errors.
TimeoutPolicy: a wall-clock or progress-based cap on a node attempt.
error_handler: a node that runs after retries are exhausted, with the failure context attached.

code

from langgraph.graph import StateGraph

from langgraph.types import RetryPolicy, TimeoutPolicy

from langgraph.errors import NodeError

(

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4, backoff_factor=2.0),

timeout=TimeoutPolicy(run_timeout=30, idle_timeout=5),

error_handler=handle_model_failure,

)

...

)

code

Starting from retries

Without first-class support you end up writing the same wrapper inside every node:

code

def call_llm(state):

~25 lines of "retry with backoff, but only on 5xx,

don't retry on 4xx, log each attempt, sleep with jitter"

...

code

LangGraph’s RetryPolicy removes that boilerplate. It applies *per node attempt*, with exponential backoff, optional jitter, and a configurable predicate for which exceptions count as retryable:

code

from langgraph.types import RetryPolicy

policy = RetryPolicy(

initial_interval=0.5,

backoff_factor=2.0,

max_interval=128.0,

max_attempts=3,

jitter=True,

retry_on=(ConnectionError, TimeoutError), # or a callable

)

code

The default retry_on is intentionally conservative: it retries ConnectionError, 5xx responses from httpx/requests, and a few generic transient categories.

By default it does not retry ValueError, TypeError, RuntimeError, etc., which are almost always programming bugs.

The retry_on spec can be a collection of error types or a callable that checks an error at runtime to see if it matches retry criteria.

Timeout: a special case of “transient failure”

LangGraph’s TimeoutPolicy supports two types of timeouts:‍

code

from langgraph.types import TimeoutPolicy

TimeoutPolicy(

run_timeout=30.0, # hard wall-clock cap on a single attempt

idle_timeout=5.0, # max time without observable progress

refresh_on="auto", # or "heartbeat"

)

code

‍run_timeout is a hard wall-clock cap on a single attempt. Useful when you simply do not care to ever wait more than N seconds for a node.

idle_timeout resets on every “progress” signal: channel writes, streamed chunks (automatically emitted from LangChain LLM models), child task events, LangChain callback events. Long-running but actively-streaming work doesn’t trip it, but a truly hung call does.Internally, it relies on “heartbeat” for every signal. If you control the work and emit your own progress beats, you can switch to refresh_on="heartbeat" and explicitly call runtime.heartbeat() from inside the node.

When a timeout fires, the node attempt is cancelled and a NodeTimeoutError is raised.

Error handlers: when retries aren’t enough

mark the order as failed and notify the customer, or
roll back the partial side effects we already committed, or
publish a payment.failed event for the rest of the system to react to.”

In LangGraph, this is now supported naturally (docs: Error handling):‍

code

from langgraph.errors import NodeError

def on_call_llm_failed(state: State, error: NodeError) -> State:

log.error("call_llm failed after retries:%s", error.error)

return {"status": "llm_unavailable"}

StateGraph(State)

.add_node(

"call_llm",

call_llm,

retry_policy=RetryPolicy(max_attempts=4),

error_handler=on_call_llm_failed,

)

code

‍A few things to notice about how this is wired:

The failure context is injected. The handler can use parameter as NodeError to get the failing node’s name plus the exception (error.node, error.error).

*in LangGraph, we call an “execution cycle” a “superstep”, if you’re familiar with the runtime.

You can set a default handler for every node. set_node_defaults applies to every regular node that doesn’t specify its own, but a per-node error_handler= always wins.

You can’t set another error handler for an error handler. So you don’t get infinite-recursion behavior.

Putting it together: fault tolerant flight booking

This is called the SAGA pattern, and it's a standard way to handle failures in distributed systems where you can't wrap everything in a single database transaction.

Here's what that looks like in LangGraph:

code

from typing import TypedDict, Annotated, Literal

import operator

from langgraph.graph import StateGraph, START, END

from langgraph.types import Command, RetryPolicy

from langgraph.errors import NodeError

class BookingState(TypedDict, total=False):

booking_id: str

passenger: str

flight: str

seat: str # assigned once the seat is reserved

amount: int # fare to charge, in minor units

payment_ref: str # set once payment is captured

ticket_no: str # set once the ticket is issued

completed: Annotated[list[str], operator.add] # accumulates per-step

def to_compensate(state: BookingState, error: NodeError) -> Command:

"""Route any retry-exhausted step to the compensation node."""

return Command(

"""include the failed node"""

update={"completed": [f"FAILED:{error.node}"]},

goto="compensate",

)

def reserve_seat(state) -> BookingState:

Call the seat-inventory service to hold a seat for this itinerary.

...

return {"seat": "12A", "completed": ["reserve_seat"]}

def process_payment(state) -> BookingState:

Charge the fare via the payment processor while the seat is held.

...

return {"payment_ref": "pay_abc123", "completed": ["process_payment"]}

def issue_ticket(state) -> BookingState:

Confirm the seat and issue the ticket once payment is captured.

...

return {"ticket_no": "TKT-7788", "completed": ["issue_ticket"]}

def compensate(state) -> Command[Literal["__end__"]]:

Inspect `state["completed"]` and undo only the steps that actually ran,

in reverse order, to keep the booking all-or-nothing.

if "issue_ticket" in state["completed"]:

void_ticket(state)

if "process_payment" in state["completed"]:

refund_payment(state)

if "reserve_seat" in state["completed"]:

release_seat(state)

return Command(goto=END)

graph = (

StateGraph(BookingState)

All steps share the same retry policy and the same fallback target;

per-step overrides are still possible.

.set_node_defaults(retry_policy=RETRYABLE, error_handler=to_compensate)

.add_node("reserve_seat", reserve_seat)

.add_node("process_payment", process_payment)

.add_node("issue_ticket", issue_ticket)

.add_node("compensate", compensate)

.add_edge(START, "reserve_seat")

.add_edge("reserve_seat", "process_payment")

.add_edge("process_payment", "issue_ticket")

.add_edge("issue_ticket", END)

.compile(checkpointer=checkpointer)

)

code

What this gives you:

Per-step backoff retries with the configured policy
An atomic transition into compensate once any step's retries are exhausted
Persistent state tracking which steps actually completed, so compensate only undoes what needs to revert

Final words

That raises the bar for reliability. A 1% transient failure rate is a minor inconvenience in a demo. In a production agent with dozens of steps and real-world consequences, it compounds quickly.

Get started: configure per-node retries, timeouts, and error handlers with the official Fault tolerance docs.

‍

この記事をシェア

LangChain Blog2026年7月21日 02:00

LangChain、評価ベンチ「IssueBench」発表

LangChain Blog重要度42026年7月21日 00:46

LangChain、ガバナンス型エージェント枠組みを発表

LangChain Blog2026年7月22日 03:27

Apollo が Deep Agents と LangSmith で GTM AI を実現

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

重要な引用

影響分析

編集コメント

リトライから始める

Timeout: 「一時的な失敗」の特殊ケース

エラーハンドラ：リトライでは不十分な場合

組み合わせる：耐障害性の高いフライト予約

最後の言葉

Starting from retries

~25 lines of "retry with backoff, but only on 5xx,

don't retry on 4xx, log each attempt, sleep with jitter"

Timeout: a special case of “transient failure”

Error handlers: when retries aren’t enough

Putting it together: fault tolerant flight booking

Call the seat-inventory service to hold a seat for this itinerary.

Charge the fare via the payment processor while the seat is held.

Confirm the seat and issue the ticket once payment is captured.

Inspect state["completed"] and undo only the steps that actually ran,

in reverse order, to keep the booking all-or-nothing.

All steps share the same retry policy and the same fallback target;

per-step overrides are still possible.

Final words

関連記事

キーポイント

重要な引用

影響分析

編集コメント

リトライから始める

Timeout: 「一時的な失敗」の特殊ケース

エラーハンドラ：リトライでは不十分な場合

組み合わせる：耐障害性の高いフライト予約

最後の言葉

Starting from retries

~25 lines of "retry with backoff, but only on 5xx,

don't retry on 4xx, log each attempt, sleep with jitter"

Timeout: a special case of “transient failure”

Error handlers: when retries aren’t enough

Putting it together: fault tolerant flight booking

Call the seat-inventory service to hold a seat for this itinerary.

Charge the fare via the payment processor while the seat is held.

Confirm the seat and issue the ticket once payment is captured.

Inspect state["completed"] and undo only the steps that actually ran,

in reverse order, to keep the booking all-or-nothing.

All steps share the same retry policy and the same fallback target;

per-step overrides are still possible.

Final words

関連記事

Inspect `state["completed"]` and undo only the steps that actually ran,

Inspect `state["completed"]` and undo only the steps that actually ran,