Anthropic Engineering·2025年3月20日 09:00·約5分

「考える」ツール：Claudeが複雑なツール使用状況で立ち止まって思考できるようにする

#LLM #エージェントシステム #ツール使用 #意思決定 #Anthropic #Claude

TL;DR

AnthropicはClaudeの複雑なツール使用シナリオ向けに「think」ツールを紹介し、拡張思考機能との使い分けを明確にし、実装ガイダンスを提供している。

AI深層分析2026年3月1日 11:45

注目/ 5段階

深度40%

キーポイント

「think」ツールの定義と目的

Claudeが応答生成中に一時停止して思考する専用スペースを提供するツールで、長いツール呼び出しチェーンや多段階会話で外部情報を処理する際に特に有効。

拡張思考機能との明確な区別

拡張思考は応答生成前の計画立案に焦点を当てるのに対し、「think」ツールは応答生成中の情報不足時の思考に特化しており、用途が異なる。

推奨使用シナリオの明確化

拡張思考は単純なツール使用やコーディング・数学に適し、「think」ツールは複雑なツール連鎖、ポリシー重視環境、コストの高い逐次的意思決定に適している。

実装ガイダンスの提供

検証済みベンチマーク結果に基づく実践的な開発者向けガイダンスを提供し、異なるアプリケーションでの実装方法を探求している。

τ-Benchでの評価結果

「think」ツールはClaude 3.7のパフォーマンスを大幅に向上させ、航空会社ドメインではベースライン比54%の相対改善を示した。

最適化されたプロンプトの効果

航空会社ドメインでは「think」ツールに最適化されたプロンプトを組み合わせることで最高のパフォーマンスが達成された。

thinkツールの最適化プロンプトの効果

最適化されたプロンプトと組み合わせたthinkツールは、航空会社ポリシーなどの複雑な領域で特に高い性能を発揮し、他のアプローチを大きく上回った。

影響分析・編集コメントを表示

影響分析

この記事は、大規模言語モデルの実用的なエージェント能力向上に焦点を当てた技術的進歩を示しており、特に複雑なツール使用シナリオでの意思決定精度と信頼性向上に寄与する。開発者コミュニティに対して具体的な実装ガイダンスを提供することで、産業応用の加速が期待される。

編集コメント

技術的な詳細よりも実装ガイダンスに重点を置いた実用的な記事で、既存の拡張思考機能との明確な使い分けを提示している点が開発者にとって価値が高い。

私たちの実験（「think」ツール使用サンプル30件、非使用サンプル144件）では、このツールを含めることの独立した効果が、平均1.6%のパフォーマンス向上をもたらすことが示されました（Welchのt検定: t(38.89) = 6.71, p < .001, d = 1.47）。

「think」ツールを使用すべきタイミング

これらの評価結果に基づき、Claudeが「think」ツールから最も恩恵を受ける特定のシナリオを特定しました:

ツール出力分析: Claudeが行動する前に以前のツール呼び出しの出力を慎重に処理する必要があり、アプローチをバックトラックする必要がある可能性がある場合。
ポリシー重視の環境: Claudeが詳細なガイドラインに従い、コンプライアンスを検証する必要がある場合。
逐次的意思決定: 各アクションが以前のものに基づいて構築され、ミスが高くつく場合（多段階ドメインでよく見られます）。

実装のベストプラクティス

Claudeで「think」ツールを最大限に活用するために、私たちのτ-bench実験に基づいて以下の実装プラクティスを推奨します。

1. ドメイン固有の例を用いた戦略的プロンプティング

最も効果的なアプローチは、τ-bench航空会社ドメインで使用されたような、「think」ツールをいつ、どのように使用するかについて明確な指示を提供することです。特定のユースケースに合わせた例を提供することは、モデルが「think」ツールをどの程度効果的に使用するかを大幅に改善します:

推論プロセスで期待される詳細レベル
複雑な指示を実行可能なステップに分解する方法
一般的なシナリオを処理するための決定木
必要な情報がすべて収集されたかどうかを確認する方法

2. 複雑なガイダンスはシステムプロンプトに配置する

長くかつ/または複雑な場合、「think」ツールに関する指示をシステムプロンプトに含めることは、ツール記述自体に配置するよりも効果的であることがわかりました。このアプローチはより広範なコンテキストを提供し、モデルが思考プロセスを全体的な動作により良く統合するのに役立ちます。

「think」ツールを使用すべきでない場合

「think」ツールは大幅な改善を提供できますが、すべてのツール使用ユースケースに適用できるわけではなく、プロンプトの長さと出力トークンの増加というコストが伴います。具体的には、「think」ツールが以下のユースケースでは改善を提供しないことがわかりました:

並列呼び出し: Claudeがタスクを完了するために単一のツール呼び出しまたは複数の並列呼び出しのみを必要とする場合、「think」を追加しても改善は見込めません。
単純な指示の遵守: Claudeが従う必要がある制約が多くなく、そのデフォルトの動作が十分に良い場合、追加の「think」による利得は見込めません。

始め方

「think」ツールは、Claude実装への簡単な追加であり、わずかなステップで意味のある改善をもたらすことができます:

エージェンシックなツール使用シナリオでテストする: Claudeが現在ポリシーコンプライアンスや長いツール呼び出しチェーンでの複雑な推論に苦労している挑戦的なユースケースから始めます。
ツール定義を追加する: ドメインにカスタマイズされた「think」ツールを実装します。最小限のコードで済みますが、より構造化された推論を可能にします。また、システムプロンプトに、ツールをいつ、どのように使用するかについての指示と、ドメインに関連する例を含めることを検討してください。
監視と改善: Claudeが実際にツールをどのように使用するかを観察し、より効果的な思考パターンを促すようにプロンプトを調整します。

最も良い点は、このツールを追加することによるパフォーマンス結果へのデメリットが最小限であることです。Claudeが使用することを決定しない限り外部動作は変更されず、既存のツールやワークフローを妨げません。

私たちの研究は、「think」ツールが、ポリシー遵守と長いツール呼び出しチェーンでの推論を必要とする複雑なタスクにおいて、Claude 3.7 Sonnetのパフォーマンス<sup>1</sup>を大幅に向上させることができることを実証しました。「think」は万能の解決策ではありませんが、適切なユースケースでは実質的な利点を提供し、実装の複雑さは最小限です。

皆さんが「think」ツールを使用して、Claudeでより能力が高く、信頼性が高く、透明性のあるAIシステムを構築されることを楽しみにしています。

<sup>1</sup> 私たちのτ-Bench結果は「think」ツールによるClaude 3.7 Sonnetの改善に焦点を当てましたが、実験ではClaude 3.5 Sonnet（New）も3.7 Sonnetと同じ構成でパフォーマンス向上を達成できることが示されており、この改善が他のClaudeモデルにも一般化することを示しています。

Tau-Bench評価の「航空会社」ドメインにおけるClaude 3.7 Sonnetのパフォーマンス（4つの異なる構成下）。

Tau-Bench評価の「小売」ドメインにおけるClaude 3.7 Sonnetのパフォーマンス（3つの異なる構成下）。

原文を表示

Extended thinking update

Extended thinking capabilities have improved since its initial release, such that we recommend using that feature instead of a dedicated think tool in most cases. Extended thinking provides similar benefits—giving Claude space to reason through complex problems—with better integration and performance. See our extended thinking documentation for implementation details.

As we continue to enhance Claude's complex problem-solving abilities, we've discovered a particularly effective approach: a "think" tool that creates dedicated space for structured thinking during complex tasks.

This simple yet powerful technique—which, as we’ll explain below, is different from Claude’s new “extended thinking” capability (see here for extended thinking implementation details)—has resulted in remarkable improvements in Claude's agentic tool use ability. This includes following policies, making consistent decisions, and handling multi-step problems, all with minimal implementation overhead.

In this post, we'll explore how to implement the “think” tool on different applications, sharing practical guidance for developers based on verified benchmark results.

What is the "think" tool?

With the "think" tool, we're giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer.

While it sounds similar to extended thinking, it's a different concept. Extended thinking is all about what Claude does before it starts generating a response. With extended thinking, Claude deeply considers and iterates on its plan before taking action. The "think" tool is for Claude, once it starts generating a response, to add a step to stop and think about whether it has all the information it needs to move forward. This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.

This makes the “think” tool more suitable for cases where Claude does not have all the information needed to formulate its response from the user query alone, and where it needs to process external information (e.g. information in tool call results). The reasoning Claude performs with the “think” tool is less comprehensive than what can be obtained with extended thinking, and is more focused on new information that the model discovers.

We recommend using extended thinking for simpler tool use scenarios like non-sequential tool calls or straightforward instruction following. Extended thinking is also useful for use cases, like coding, math, and physics, when you don’t need Claude to call tools. The “think” tool is better suited for when Claude needs to call complex tools, analyze tool outputs carefully in long chains of tool calls, navigate policy-heavy environments with detailed guidelines, or make sequential decisions where each step builds on previous ones and mistakes are costly.

Here's a sample implementation using the standard tool specification format that comes from τ-Bench:

{ "name": "think", "description": "Use the tool to think about something. It will not obtain new information or change the database, but just append the thought to the log. Use it when complex reasoning or some cache memory is needed.", "input_schema": { "type": "object", "properties": { "thought": { "type": "string", "description": "A thought to think about." } }, "required": ["thought"] } }

CopyPerformance on τ-Bench

We evaluated the "think" tool using τ-bench (tau-bench), a comprehensive benchmark designed to test a model’s ability to use tools in realistic customer service scenarios, where the "think" tool is part of the evaluation’s standard environment.

τ-bench evaluates Claude's ability to:

Navigate realistic conversations with simulated users

Follow complex customer service agent policy guidelines consistently

Use a variety of tools to access and manipulate the environment database

The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.

Performance Analysis

Our evaluation compared several different configurations:

Baseline (no "think" tool, no extended thinking mode)

Extended thinking mode alone

"Think" tool alone

"Think" tool with optimized prompt (for airline domain)

The results showed dramatic improvements when Claude 3.7 effectively used the "think" tool in both the “airline” and “retail” customer service domains of the benchmark:

Airline domain: The "think" tool with an optimized prompt achieved 0.570 on the pass^1 metric, compared to just 0.370 for the baseline—a 54% relative improvement;

Retail domain: The "think" tool alone achieves 0.812, compared to 0.783 for the baseline.

Claude 3.7 Sonnet's performance on the "Airline" domain of the Tau-Bench eval

"Think" + Prompt

Extended thinking

Evaluation results across four different configurations. Scores are proportions.

The best performance in the airline domain was achieved by pairing the “think” tool with an optimized prompt that gives examples of the type of reasoning approaches to use when analyzing customer requests. Below is an example of the optimized prompt:

Using the think tool Before taking any action or responding to the user after receiving tool results, use the think tool as a scratchpad to: - List the specific rules that apply to the current request - Check if all required information is collected - Verify that the planned action complies with all policies - Iterate over tool results for correctness Here are some examples of what to iterate over inside the think tool: <think_tool_example_1> User wants to cancel flight ABC123 - Need to verify: user ID, reservation ID, reason - Check cancellation rules: * Is it within 24h of booking? * If not, check ticket class and insurance - Verify no segments flown or are in the past - Plan: collect missing info, verify rules, get confirmation </think_tool_example_1> <think_tool_example_2> User wants to book 3 tickets to NYC with 2 checked bags each - Need user ID to check: * Membership tier for baggage allowance * Which payments methods exist in profile - Baggage calculation: * Economy class × 3 passengers * If regular member: 1 free bag each → 3 extra bags = $150 * If silver member: 2 free bags each → 0 extra bags = $0 * If gold member: 3 free bags each → 0 extra bags = $0 - Payment rules to verify: * Max 1 travel certificate, 1 credit card, 3 gift cards * All payment methods must be in profile * Travel certificate remainder goes to waste - Plan: 1. Get user ID 2. Verify membership level for bag fees 3. Check which payment methods in profile and if their combination is allowed 4. Calculate total: ticket price + any bag fees 5. Get explicit confirmation for booking </think_tool_example_2>

CopyWhat's particularly interesting is how the different approaches compared. Using the “think” tool with the optimized prompt achieved significantly better results over extended thinking mode (which showed similar performance to the unprompted “think” tool). Using the "think" tool alone (without prompting) improved performance over baseline, but still fell short of the optimized approach.

The combination of the "think" tool with optimized prompting delivered the strongest performance by a significant margin, likely due to the high complexity of the airline policy part of the benchmark, where the model benefitted the most from being given examples of how to “think.”

In the retail domain, we also tested various configurations to understand the specific impact of each approach

Claude 3.7 Sonnet's performance on the "Retail" domain of the Tau-Bench eval

"Think" + no prompt

Extended thinking

Evaluation results across three different configurations. Scores are proportions.

The "think" tool achieved the highest pass^1 score of 0.812 even without additional prompting. The retail policy is noticeably easier to navigate compared to the airline domain, and Claude was able to improve just by having a space to think without further guidance.

Key Insights from τ-Bench Analysis

Our detailed analysis revealed several patterns that can help you implement the "think" tool effectively:

Prompting matters significantly on difficult domains. Simply making the "think" tool available might improve performance somewhat, but pairing it with optimized prompting yielded dramatically better results for difficult domains. However, easier domains may benefit from simply having access to “think.”

Improved consistency across trials. The improvements from using “think” were maintained for pass^k up to k=5, indicating that the tool helped Claude handle edge cases and unusual scenarios more effectively.

Performance on SWE-Bench

A similar “think” tool was added to our SWE-bench setup when evaluating Claude 3.7 Sonnet, contributing to the achieved state-of-the-art score of 0.623. The adapted “think” tool definition is given below:

{ "name": "think", "description": "Use the tool to think about something. It will not obtain new information or make any changes to the repository, but just log the thought. Use it when complex reasoning or brainstorming is needed. For example, if you explore the repo and discover the source of a bug, call this tool to brainstorm several unique ways of fixing the bug, and assess which change(s) are likely to be simplest and most effective. Alternatively, if you receive some test results, call this tool to brainstorm ways to fix the failing tests.", "input_schema": { "type": "object", "properties": { "thought": { "type": "string", "description": "Your thoughts." } }, "required": ["thought"] } }

CopyOur experiments (n=30 samples with "think" tool, n=144 samples without) showed the isolated effects of including this tool improved performance by 1.6% on average (Welch's t-test: t(38.89) = 6.71, p < .001, d = 1.47).

When to use the "think" tool

Based on these evaluation results, we've identified specific scenarios where Claude benefits most from the "think" tool:

Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;

Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and

Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Implementation best practices

To get the most out of the "think" tool with Claude, we recommend the following implementation practices based on our τ-bench experiments.

Strategic prompting with domain-specific examples

The most effective approach is to provide clear instructions on when and how to use the "think" tool, such as the one used for the τ-bench airline domain. Providing examples tailored to your specific use case significantly improves how effectively the model uses the "think" tool:

The level of detail expected in the reasoning process;

How to break down complex instructions into actionable steps;

Decision trees for handling common scenarios; and

How to check if all necessary information has been collected.

Place complex guidance in the system prompt

We found that, when they were long and/or complex, including instructions about the "think" tool in the system prompt was more effective than placing them in the tool description itself. This approach provides broader context and helps the model better integrate the thinking process into its overall behavior.

When not to use the "think" tool

Whereas the “think” tool can offer substantial improvements, it is not applicable to all tool use use cases, and does come at the cost of increased prompt length and output tokens. Specifically, we have found the “think” tool does not offer any improvements in the following use cases:

Non-sequential tool calls. If Claude only needs to make a single tool call or multiple parallel calls to complete a task, there is unlikely to be any improvements from adding in “think.”

Simple instruction following. When there are not many constraints to which Claude needs to adhere, and its default behaviour is good enough, there are unlikely to be gains from additional “think”-ing.

Getting started

The "think" tool is a straightforward addition to your Claude implementation that can yield meaningful improvements in just a few steps:

Test with agentic tool use scenarios. Start with challenging use cases—ones where Claude currently struggles with policy compliance or complex reasoning in long tool call chains.

Add the tool definition. Implement a "think" tool customized to your domain. It requires minimal code but enables more structured reasoning. Also consider including instructions on when and how to use the tool, with examples relevant to your domain to the system prompt.

Monitor and refine. Watch how Claude uses the tool in practice, and adjust your prompts to encourage more effective thinking patterns.

The best part is that adding this tool has minimal downside in terms of performance outcomes. It doesn't change external behavior unless Claude decides to use it, and doesn't interfere with your existing tools or workflows.

Our research has demonstrated that the "think" tool can significantly enhance Claude 3.7 Sonnet's performance1 on complex tasks requiring policy adherence and reasoning in long chains of tool calls. “Think” is not a one-size-fits-all solution, but it offers substantial benefits for the correct use cases, all with minimal implementation complexity.

We look forward to seeing how you'll use the "think" tool to build more capable, reliable, and transparent AI systems with Claude.

While our τ-Bench results focused on the improvement of Claude 3.7 Sonnet with the “think” tool, our experiments show Claude 3.5 Sonnet (New) is also able to achieve performance gains with the same configuration as 3.7 Sonnet, indicating that this improvement generalizes to other Claude models as well.

Claude 3.7 Sonnet's performance on the "airline" domain of the Tau-Bench eval under four different configurations.

Performance of Claude 3.7 Sonnet on the "retail" domain of the Tau-Bench eval under three different configurations.

この記事をシェア

TechCrunch AI2026年3月1日 06:05

アンソロピックのClaude、ペンタゴンとの紛争後にApp Storeで第2位に上昇

TechCrunch AI2026年3月1日 23:48

ペンタゴンとの論争後、AnthropicのClaudeがApp Storeで1位に上昇

KDnuggets2026年7月3日 21:00

Python で Claude API を使い始めるガイド

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Anthropic Engineering·2025年3月20日 09:00·約5分

「考える」ツール：Claudeが複雑なツール使用状況で立ち止まって思考できるようにする

#LLM #エージェントシステム #ツール使用 #意思決定 #Anthropic #Claude

TL;DR

AnthropicはClaudeの複雑なツール使用シナリオ向けに「think」ツールを紹介し、拡張思考機能との使い分けを明確にし、実装ガイダンスを提供している。

AI深層分析2026年3月1日 11:45

注目/ 5段階

深度40%

キーポイント

「think」ツールの定義と目的

拡張思考機能との明確な区別

拡張思考は応答生成前の計画立案に焦点を当てるのに対し、「think」ツールは応答生成中の情報不足時の思考に特化しており、用途が異なる。

推奨使用シナリオの明確化

実装ガイダンスの提供

検証済みベンチマーク結果に基づく実践的な開発者向けガイダンスを提供し、異なるアプリケーションでの実装方法を探求している。

τ-Benchでの評価結果

「think」ツールはClaude 3.7のパフォーマンスを大幅に向上させ、航空会社ドメインではベースライン比54%の相対改善を示した。

最適化されたプロンプトの効果

航空会社ドメインでは「think」ツールに最適化されたプロンプトを組み合わせることで最高のパフォーマンスが達成された。

thinkツールの最適化プロンプトの効果

影響分析・編集コメントを表示

影響分析

編集コメント

「think」ツールを使用すべきタイミング

これらの評価結果に基づき、Claudeが「think」ツールから最も恩恵を受ける特定のシナリオを特定しました:

ツール出力分析: Claudeが行動する前に以前のツール呼び出しの出力を慎重に処理する必要があり、アプローチをバックトラックする必要がある可能性がある場合。
ポリシー重視の環境: Claudeが詳細なガイドラインに従い、コンプライアンスを検証する必要がある場合。
逐次的意思決定: 各アクションが以前のものに基づいて構築され、ミスが高くつく場合（多段階ドメインでよく見られます）。

実装のベストプラクティス

Claudeで「think」ツールを最大限に活用するために、私たちのτ-bench実験に基づいて以下の実装プラクティスを推奨します。

1. ドメイン固有の例を用いた戦略的プロンプティング

推論プロセスで期待される詳細レベル
複雑な指示を実行可能なステップに分解する方法
一般的なシナリオを処理するための決定木
必要な情報がすべて収集されたかどうかを確認する方法

2. 複雑なガイダンスはシステムプロンプトに配置する

「think」ツールを使用すべきでない場合

並列呼び出し: Claudeがタスクを完了するために単一のツール呼び出しまたは複数の並列呼び出しのみを必要とする場合、「think」を追加しても改善は見込めません。
単純な指示の遵守: Claudeが従う必要がある制約が多くなく、そのデフォルトの動作が十分に良い場合、追加の「think」による利得は見込めません。

始め方

「think」ツールは、Claude実装への簡単な追加であり、わずかなステップで意味のある改善をもたらすことができます:

エージェンシックなツール使用シナリオでテストする: Claudeが現在ポリシーコンプライアンスや長いツール呼び出しチェーンでの複雑な推論に苦労している挑戦的なユースケースから始めます。
ツール定義を追加する: ドメインにカスタマイズされた「think」ツールを実装します。最小限のコードで済みますが、より構造化された推論を可能にします。また、システムプロンプトに、ツールをいつ、どのように使用するかについての指示と、ドメインに関連する例を含めることを検討してください。
監視と改善: Claudeが実際にツールをどのように使用するかを観察し、より効果的な思考パターンを促すようにプロンプトを調整します。

皆さんが「think」ツールを使用して、Claudeでより能力が高く、信頼性が高く、透明性のあるAIシステムを構築されることを楽しみにしています。

Tau-Bench評価の「航空会社」ドメインにおけるClaude 3.7 Sonnetのパフォーマンス（4つの異なる構成下）。

Tau-Bench評価の「小売」ドメインにおけるClaude 3.7 Sonnetのパフォーマンス（3つの異なる構成下）。

原文を表示

Extended thinking update

In this post, we'll explore how to implement the “think” tool on different applications, sharing practical guidance for developers based on verified benchmark results.

What is the "think" tool?

With the "think" tool, we're giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer.

Here's a sample implementation using the standard tool specification format that comes from τ-Bench:

CopyPerformance on τ-Bench

τ-bench evaluates Claude's ability to:

Navigate realistic conversations with simulated users

Follow complex customer service agent policy guidelines consistently

Use a variety of tools to access and manipulate the environment database

Performance Analysis

Our evaluation compared several different configurations:

Baseline (no "think" tool, no extended thinking mode)

Extended thinking mode alone

"Think" tool alone

"Think" tool with optimized prompt (for airline domain)

The results showed dramatic improvements when Claude 3.7 effectively used the "think" tool in both the “airline” and “retail” customer service domains of the benchmark:

Airline domain: The "think" tool with an optimized prompt achieved 0.570 on the pass^1 metric, compared to just 0.370 for the baseline—a 54% relative improvement;

Retail domain: The "think" tool alone achieves 0.812, compared to 0.783 for the baseline.

Claude 3.7 Sonnet's performance on the "Airline" domain of the Tau-Bench eval

"Think" + Prompt

Extended thinking

Evaluation results across four different configurations. Scores are proportions.

Using the think tool Before taking any action or responding to the user after receiving tool results, use the think tool as a scratchpad to: - List the specific rules that apply to the current request - Check if all required information is collected - Verify that the planned action complies with all policies - Iterate over tool results for correctness Here are some examples of what to iterate over inside the think tool: <think_tool_example_1> User wants to cancel flight ABC123 - Need to verify: user ID, reservation ID, reason - Check cancellation rules: * Is it within 24h of booking? * If not, check ticket class and insurance - Verify no segments flown or are in the past - Plan: collect missing info, verify rules, get confirmation </think_tool_example_1> <think_tool_example_2> User wants to book 3 tickets to NYC with 2 checked bags each - Need user ID to check: * Membership tier for baggage allowance * Which payments methods exist in profile - Baggage calculation: * Economy class × 3 passengers * If regular member: 1 free bag each → 3 extra bags = $150 * If silver member: 2 free bags each → 0 extra bags = $0 * If gold member: 3 free bags each → 0 extra bags = $0 - Payment rules to verify: * Max 1 travel certificate, 1 credit card, 3 gift cards * All payment methods must be in profile * Travel certificate remainder goes to waste - Plan: 1. Get user ID 2. Verify membership level for bag fees 3. Check which payment methods in profile and if their combination is allowed 4. Calculate total: ticket price + any bag fees 5. Get explicit confirmation for booking </think_tool_example_2>

In the retail domain, we also tested various configurations to understand the specific impact of each approach

Claude 3.7 Sonnet's performance on the "Retail" domain of the Tau-Bench eval

"Think" + no prompt

Extended thinking

Evaluation results across three different configurations. Scores are proportions.

Key Insights from τ-Bench Analysis

Our detailed analysis revealed several patterns that can help you implement the "think" tool effectively:

Performance on SWE-Bench

When to use the "think" tool

Based on these evaluation results, we've identified specific scenarios where Claude benefits most from the "think" tool:

Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;

Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and

Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Implementation best practices

To get the most out of the "think" tool with Claude, we recommend the following implementation practices based on our τ-bench experiments.

Strategic prompting with domain-specific examples

The level of detail expected in the reasoning process;

How to break down complex instructions into actionable steps;

Decision trees for handling common scenarios; and

How to check if all necessary information has been collected.

Place complex guidance in the system prompt

When not to use the "think" tool

Non-sequential tool calls. If Claude only needs to make a single tool call or multiple parallel calls to complete a task, there is unlikely to be any improvements from adding in “think.”

Getting started

The "think" tool is a straightforward addition to your Claude implementation that can yield meaningful improvements in just a few steps:

Test with agentic tool use scenarios. Start with challenging use cases—ones where Claude currently struggles with policy compliance or complex reasoning in long tool call chains.

Monitor and refine. Watch how Claude uses the tool in practice, and adjust your prompts to encourage more effective thinking patterns.

We look forward to seeing how you'll use the "think" tool to build more capable, reliable, and transparent AI systems with Claude.

While our τ-Bench results focused on the improvement of Claude 3.7 Sonnet with the “think” tool, our experiments show Claude 3.5 Sonnet (New) is also able to achieve performance gains with the same configuration as 3.7 Sonnet, indicating that this improvement generalizes to other Claude models as well.

この記事をシェア

TechCrunch AI2026年3月1日 06:05

アンソロピックのClaude、ペンタゴンとの紛争後にApp Storeで第2位に上昇

TechCrunch AI2026年3月1日 23:48

ペンタゴンとの論争後、AnthropicのClaudeがApp Storeで1位に上昇

KDnuggets2026年7月3日 21:00

Python で Claude API を使い始めるガイド

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む