読み込み中…

Amazon Science·2026年6月9日 02:00·約22分

エージェントシステムにおける意図と実行の架橋

#AI Agent #LLM #System Architecture #Reasoning #Benchmarking

TL;DR

Amazon Science は、AI エージェントの性能ボトルネックがモデルそのものではなく、意図と実行を繋ぐ「ハネス」にあることを指摘し、両者のギャップ解消が最重要課題であると分析している。

AI深層分析2026年6月9日 03:02

重要/ 5段階

深度40%

キーポイント

意図と実行のギャップ（Intent-Execution Gap）の定義

モデルの推論能力向上に伴い、ボトルネックは「ハネス」による意図の翻訳ミスや実行結果のフィードバック不全にシフトしており、この双方向のミスマッチを最小化することが SOTA 性能達成の鍵となる。

ベンチマーク評価におけるインフラ要因の影響

タイムアウト設定やリソース制約などの基礎的なインフラパラメータが結果に大きく影響するため、単純なベンチマーク数値の比較は実力を正確に反映しない可能性があり注意が必要である。

Simple Strands Agent (SSA) の提案

ドキュメントとオープンソース実装の乖離を埋めるため、軽量でカスタマイズ可能なシングルエージェントハネス「SSA」を導入し、モデルやタスクに依存しない一貫した性能向上を実現した。

モデル固有の特性とコードデザインの重要性

エージェント設計は完全にモデル非依存的ではなく、ツール使用や文脈感度において異なるモデルファミリーが特有の嗜好を示すため、最適な性能にはモデルとハネスの共同設計（codesign）が不可欠である。

重要な引用

A modern agent combines an LLM with a harness... you can think of the harness as the operating system around the model.

We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa.

Benchmaxing... may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations.

影響分析・編集コメントを表示

影響分析

この分析は、AI エージェント開発の焦点を「より賢いモデル」から「より堅牢なシステムアーキテクチャ」へとシフトさせるべきという重要な示唆を与えます。特に、ベンチマーク評価におけるインフラ要因の影響を指摘した点は、業界全体が真の実力を測定する基準を見直す契機となり得ます。また、モデル固有の特性への言及は、汎用的なフレームワーク開発から、特定のモデルに最適化されたカスタム設計への転換を促す可能性があります。

編集コメント

モデルの推論能力が飛躍的に向上する中、その性能を最大限引き出すための「システム側」の課題解決が急務であることを浮き彫りにした鋭い洞察です。開発者は今、単なるプロンプトエンジニアリングから、ハネスとモデルの協調設計へと視座を移す必要があります。

AI エージェントのパフォーマンスは単なるモデル化の問題ではなく、根本的にシステムの問題です。現代のエージェントは、LLM とハーン（LLM のツールとの相互作用を仲介し、推論とフィードバックのサイクルを管理するソフトウェア）を組み合わせたものです。このハーンは、モデルを取り巻くオペレーティングシステムと考えることができます。モデルが改善されるにつれて、パフォーマンスのボトルネックはモデルの推論能力から、モデルの意図を実行に移す能力および実行結果をモデルに反映させる能力へと移行します。私たちはこのボトルネックを「意図 - 実行ギャップ」として形式化しました。これは、モデルが意図することとハーンが実行することの間、あるいはその逆の間の不一致です。例えば、コードの修正を試みる際、モデルは関数の単一のインスタンスを編集することを意図しているにもかかわらず、ハーンが誤って複数のインスタンスを変更してしまうことがあります。私たちは、タスク固有のチューニングを行わずにこの双方向のギャップを最小化することが、現実世界のリポジトリパッチング（SWE-Pro, SWE-Verified）やインタラクティブなターミナル環境（Terminal-Bench2）を含む多様なエージェントベンチマークにおいて最先端のパフォーマンスを達成するのに十分であることを示しました。ハーンの最も目に見えるコンポーネント — 思考・行動・観測プロセスの反復を制御する実行グラフやツールなど — は改善の自然な候補ですが、私たちは些細な実装の詳細がパフォーマンスに非自明な変動をもたらすことを強調します。環境インタラクションのタイムアウト、インフラストラクチャの安定性、リソース制約などの要因も、パフォーマンスに実質的な影響を与えます。したがって、ベンチマークでより高い数値を報告する「ベンチマックス」は、必ずしも根本的なモデル/ハーンの能力を定量化するものではなく、評価中に使用された基本的なインフラストラクチャパラメータの影響も受けます。また、私たちは Simple Strands Agent (SSA) を紹介しました。これは、エージェントドキュメントで報告されているパフォーマンスとオープンソース実装で見られるパフォーマンスの間のギャップを埋めるために設計された、軽量かつカスタマイズ可能なシングルエージェントハーンです。SSA は複数のモデルおよびベンチマークにわたって一貫したパフォーマンス向上を実現します。最後に、効果的なエージェント設計は完全にモデル非依存ではないことを示しました。多くの原則は一般化されますが、異なるモデルファミリーはツールの使用、フィードバックの解釈、文脈への感応性において明確な嗜好を示し、最適なパフォーマンスを達成するためにはモデルとハーンの共同設計が重要な要素となります。

動機付け

問題固有のカスタマイズ — 調整されたプロンプト、 tailored ツール、専門的な実行グラフなど — が制御された環境（評価インフラストラクチャなどの他の要因をすべて固定した状態）で AI モデルのパフォーマンスを向上させることはよく知られています。しかし、私たちは多くの such な最適化がモデル間で転移しないことを観察しました。あるモデルやバージョンで機能する改善は、新しいモデルでは低下したり、消失したり、場合によっては後退さえします。この転移可能性の欠如は、より深い問題を露呈しています：多くの最適化は特定のモデルの振る舞いに暗黙的に過学習しているのです。モデルが改善されるにつれてこれらの振る舞いは変化するため、こうした利益は脆く、積み重なりません。エージェントの文脈では、これは焦点の転換を示唆します：現在のモデルの振る舞いを最適化するのではなく、モデルのアップグレード、ベンチマーク、環境を問わず効果的であり続ける不変のコンポーネント — 設計原則 — を特定すべきです。このような不変性を特定するために、私たちはモデルとハーンのインターフェースに焦点を当てます。これはモデルの出力が解釈され実行される境界であり、実行結果がモデルに伝達される場所です。このインターフェースは、エージェントのパフォーマンスが設定間で低下する際の主要な失敗点です。この視点から、2 つの根本的な問いが生じます：ハーンはモデルが何を意図しているかを理解しているか？モデルはハーンがその行動をどのように解釈したかを明確に理解しているか？これらの問いは、モデルとハーンの間の核心的なアライメント問題を定義し、以下のセクションで分析する失敗モードを特徴づけます。

ツールインターフェースの失敗

エージェントの目標がコード生成である場合を考えます。私たちのエージェントは主に bash ツール（コンピュータターミナルへのアクセスを提供し、例えばコードの実行などを行う）と、コードを修正するためのファイルエディタを使用します。bash ツールは非常に強力であり、読み取り、検索、編集というすべての原子操作を消費できます。出力が長くなりすぎた場合の管理のために、私たちは単純な改良を加えました。出力を無作為に切り捨てるだけではうまくいきません。なぜなら、コマンド実行の確認の末尾にはジョブの状態やコマンドの成功/失敗といった有用な情報が含まれているからです。代わりに、応答の長さを制御するために、中央の内容を要約し、先頭と末尾の限られた行数のみを保持します。効率性と編集におけるより良いコーナーケース処理のために、bash の他にファイル編集ツールを使用しています。私たちのファイルエディタは、既存のファイル内容を新しい（モデルが提供する）内容に置き換えて編集を行う文字列置換メカニズムに基づいています。文字列置換は多くの場合うまく機能しますが、意図 - 実行ギャップを露呈する失敗モードを繰り返し観察しました：モデルには明確な意図があるにもかかわらず、ハーンはその意図を安全に実行するための十分な情報を持っていない場合があります。これらのケースでは、単純なエディタは単に性能が低下するだけでなく、高い自信で間違った編集を適用することで作業状態を積極的に損なう可能性があります。

最初の失敗モードは、モデルの提案した編集のコンテキストがコードベース内の複数の場所で出現する場合に発生します。モデルの視点からは、要求された編集は明確であるかもしれません。なぜなら、特定の関数、ブロック、またはエラー場所について推論しているからです。しかし、ハーンが「古いテキストを新しいテキストで置換する」という生のリクエストのみを受け取り、古いテキストが複数回出現する場合、どの出現が意図されているかを確実に推測することはできません。すべての一致を無作為に置換するのは危険です。実際には、より安全な行動は、ハーンがモデルに曖昧さを警告し、明確化を求めることです — 例えば、置換するテキストが一意になるように現在のコンテキストを展開するように求めるなど。これは小さな実装の詳細ですが、意図された編集と実行された編集の間の忠実度を劇的に向上させます。

2 つ目の失敗モードは、モデルが置換のために部分的な行や短い断片のみを提案する場合に現れます。部分テキストマッチングは柔軟であるため魅力的ですが、脆いものです：同じ断片がコメント内、文字リテラル内、近接する式内、または無関係なコードパス内に出現する可能性があります。断片が一意であっても、完全な論理単位（完全な行やよく境界付けられたスパン）を構成しないテキストを置換すると、誤った編集が生じる可能性があります。これらはエディタの視点からは構文上正しいかもしれませんが、モデルの視点からは意味的に意図されていないものです。私たちは、より強力なテキストアンカー — 例えば正確な行スパン、豊富な周囲コンテキスト、または行認識マッチング — を要求することで、これらの偶発的な編集を大幅に削減できることを発見しました。言い換えれば、ハーンは推測によって未指定の編集リクエストを実行すべきではありません。

3 つ目は、編集が成功して適用された場合でも、「編集が成功しました」と単に返すだけでは、モデルに対してハーンが何を変更したかについての情報が不足していることです。これは相互作用ループの逆側を弱めます：モデルは意図を明確に表現するだけでなく、その意図がどのように解釈されたかを検証できるべきです。このループを閉じるために、私たちは成功した編集の後にモデルに diff ファイル（追加と削除が行われたテキスト、および変更されていないテキストを示すテキストファイル）を提供することが有用であることを発見しました。diff は即時の確認チャネルとして機能します：モデルは置換が正しい場所に適用されたか、関連する行が変更されたか、続行編集が必要かどうかを検査できます。この一見些細なフィードバックメカニズムは、編集を「発射して忘れ去る」アクションから観測可能な状態遷移に変換するため、信頼性を向上させます。

自然な疑問が生じます：diff は成功した編集後に提供されるのに、なぜ最初の 2 つの失敗モードには特別な処理が必要なのか？ diff は意図しない変更を明らかにしますが、それはすでにミステイクが適用された後です。その時点で、モデルはロールバックするか、意図しない編集を修復するか、潜在的に破損した状態で実行を継続するかを決定しなければなりません。これはエージェントの軌跡に追加的分岐を導入し、回避可能なエラーを修正するためにトークンと推論努力を費やすことを強いるため、解決策に向かって進むことが妨げられます。つまり、すべての修正ステップはモデルのコンテキストウィンドウに追加情報を注入します。各情報は、次のアクション生成のためのエージェントの注意を競合することに注意してください。無関係または意図しない編集は単にトークンを浪費するだけでなく、偽のパターンと関係を導入することでパフォーマンスを積極的に低下させ、モデルが誤った関連付けを形成し、元の目標から逸脱する可能性を高めます。

対照的に、実行前に曖昧さと弱いアンカーに対処することは、編集が最初に正しく適用されることを保証します。これにより不必要な探索が減り、連鎖的なエラーを防ぎ、コンテキストをタスクに関連するシグナルに集中させます。実質的に、最初の 2 つの失敗モードはアクションのポイントでの正確性を改善し、diff フィードバックはアクション後の観測可能性を改善します。両方とも必要ですが、相互作用ループの根本的に異なる段階で機能しています。

推論

より目立たないが同等に重要な設計上の考慮事項は、エージェントが内部推論と外部インタラクションのバランスをどのように取るかです。思考連鎖（Chain-of-thought）推論は明らかに価値があります。これにより、モデルは問題を分解し、次のステップを計画し、どのツールを呼び出すかを決定できます。十分な推論がない場合、ツールの使用は反応的になり、浅い探索、重複する呼び出し、または行動の悪いシーケンスにつながります。しかし、過度な思考も独自の失敗モードをもたらします。モデルが内部で長時間推論すると、環境を検証するのではなく仮定を形成し始めます。これらの仮定はモデルの内部状態内では一貫して見えるかもしれませんが、実際のシステム状態とはしばしば整合性がありません。その結果、エージェントは根拠のないツール呼び出しを発行したり、必要な検証ステップを完全にスキップしたりして、根本的な緊張関係を生み出します。

効果的なエージェントはこれらの 2 つの要求を絶えず調整する必要があり、私たちはこのバランスを「推論によるナッジ付きのツール呼び出し」と呼んでいます。アイデアは、モデルが次のアクションを決定するために必要なだけの推論を行うよう促し、その後、環境との証拠収集インタラクションをさらに推論よりも優先させることです。内部思考連鎖を延長するのではなく、エージェントはツールの出力を通じて仮説を検証するようナッジされます。

実際には、すべてのモデルファミリーにわたって推論とツールインタラクションのバランスを確実に取る単一の「黄金のプロンプト」は見つかりませんでした。Claude バリアントの場合、「50 回以上のツール呼び出しを行う」「理想的なツール呼び出し数は 100」といった定量的ガイダンスを導入することで、長い思考連鎖を打破し、モデルを環境とのインタラクションへと押しやるのに役立つことがわかりました。目標とするツール呼び出しの正確な数自体は重要ではありませんが、これはモデルを行動に偏らせるための有用な北極星として機能します。しかし、私たちの実験では、この強力なナッジは Gemini や Grok などの他のファミリーには効果的ではなく、これらのモデルは指示を文字通り解釈して目標を満たすために空のツール呼び出しを行う傾向があります。このような振る舞いはエージェントの質を低下させます。ここでは、「可能な限りツールを使用するべきです」といった柔軟なナッジを使用することが十分であることがわかりました。原則は同じです：モデルに適切な量の推論とともに能動的にツールを使用するよう促す必要があります。

ツールの使用嗜好

エージェント全体で、ツールは全く同じ方法で機能しますが、モデルはそれらを呼び出す方法において明確な嗜好を示す傾向があります。例えば、GPT モデルは apply_patch コマンドを使用してコードを更新することを好み、別々のファイルからテキストを特定の形式でスプライスします；このフォーマット嗜好を拒否するとパフォーマンスが低下します。同様に、Grok-4.20 の場合、編集と表示のための単一のモノリス型ツールは混乱を引き起こし、誤ったツール呼び出しにつながります。機能を原子操作に分割することで、機能が変更されていない場合でもより良い結果が得られます。さらに、ファイル内の行番号を表示することはほとんどのモデルを助けますが、Grok のトークナイザーとアテンションメカニズムはプレフィックスを行番号から分離する点であまり堅牢ではないようであり、この機能を無効にすると表示ツールが改善されます。これらの嗜好はトレーニングの副産物です。これはより広範な設計原則を強化します：エージェントのパフォーマンスは、利用可能なツールの種類だけでなく、それらのツールがモデルの学習された振る舞いとどのように自然に整合するかの関数であるということです。よく設計されたハーンは、モデルがある場所に出会い、信頼性のある実行に必要な不変性を維持しつつ、インターフェース、フィードバック、インタラクションパターンをモデルの強みに適応させます。

ベンチマーク研究

SSA は上記の原則の多くを実装する単純なハーンです。私たちはこれを 3 つのエージェントベンチマークで評価しました — SWE-Bench-Verified (n = 500)、SWE-Bench-Pro（公開セット、n = 731）、Terminal-Bench-2 (n = 89)。SWE-Bench-Verified と SWE-Bench-Pro の各例は、コード変更によって修正すべきオープンソースのコードリポジトリと「課題」です。Terminal-Bench-2 はソフトウェアエンジニアリング、機械学習、セキュリティなどのさまざまなプログラミングタスクに取り組みますが、コードリポジトリに縛られていません。すべての 3 つのベンチマークには、生成されたコードを評価するための個別の静的な事前記述テストがあります。SWE-Bench-Verified と SWE-Bench-Pro では、実行と評価は別々のコンテナイメージ内で行われるため、変更は異なる評価環境へ転送する必要があります；Terminal-Bench-2 では、評価は同じコンテナ内で行われます。したがって、SWE の問題では、diff パッチが過度に肥大化しないように無関係なアーティファクトを除外する必要があるかもしれません。さらに、Terminal-Bench-2 は SWE ベンチマークにはない計算およびエージェントランタイムの制限を課しています。

私たちは業界で標準的な指標を使用して SSA エージェントを評価します。上記の SWE-Bench-Verified グラフに記載されている mini-swe-agent の結果と Terminal-Bench-2 グラフに記載されている Terminus の結果は、各ベンチマークに対する固定されたエージェント構成（正確に同じプロンプト、ツール仕様、構造化出力指示）に対応することに注意してください。しかし、前述したように、異なるモデルファミリーは異なる推論ナッジを必要とし、ツールの使用に対して明確な嗜好を示します。その結果、SSA のコアハーンは同一のままですが、モデルファミリー間（Claude, Gemini, GPT, Grok など）でプロンプトとツール仕様に最小限だがゼロではない違いがあります。

SSA を構築する私たちの目標は、モデルごとに個別のエージェントを最適化することではなく、共有されたハーンフレームワーク内で異なるモデルファミリーが最も強力な能力を発揮できるようにする最小限の直交適応を特定することでした。

Terminal-Bench-2

SWE-Bench-Verified や SWE-Bench-Pro とは異なり、Terminal-Bench-2 データセットは、プロジェクトごとの計算容量（メモリ、ストレージ、CPU 数）と時間（エージェントおよび検証者の実行時間）を制限することにより、エージェントの環境を制限しています。これはベンチマークスコアを向上させるために計算リソースの不均衡な使用を制限するのに効果的ですが、インフラストラクチャの選択に対してベンチマークをより敏感にするという意図しない副作用があります。

私たちは、これらの制限が与えられた場合、以下のシステム特性が最も大きな影響を与えることを観察しました：

推論バックエンドの信頼性。推論バックエンドの容量（1 分あたりのトークン数およびリクエスト数）は、評価の全期間中、同時に実行されるすべてのプロジェクトをサポートできる必要があります。呼び出し遅延の高い分散、頻繁な API タイムアウト、および再試行は、許容された時間予算を食い込み、より多くのタイムアウトと低い解決率につながります。

単一ノードで実行される並列プロジェクトの数。これは各プロジェクトに利用可能なネットワーク帯域幅に影響します。Terminal-Bench-2 におけるエージェントの最初のステップの一つは依存関係（pip, torch, transformers など一般的なライブラリ）のインストールです。評価インフラストラクチャが複数のプロジェクトを単一ノードで実行するように設定されている場合（例：Harbor で n_concurrent > 1）、各ノードの利用可能なネットワーク帯域幅はすべての並列プロジェクト間で共有されます。これにより依存関係のダウンロード時間が長くなり、エージェントに問題解決のための時間が少なくなり、完了前に中断されるリスクが高まります。

ツール呼び出しの大部分がコマンドライン指示であるため、タイムアウトに対処する自然な方法は、バッチインターフェースを導入して、エージェントが一度に複数のコマンドを実行できるようにすることです。しかし、私たちの実験では、このアプローチの結果は混合しており、前述した失敗モードの一つ — 推論とツールインタラクションのバランス — に対応しています。バッチ処理はインタラクションオーバーヘッドを削減しますが、モデルが複数のステップにわたって一貫したターミナル状態を維持する必要があり、これにより推論複雑性が増加します。Claude モデルの場合、追加の自己回帰的推論にかかる時間がバッチ処理からの利益を相殺する傾向があります。一方、他のモデルファミリー（Gemini や Grok など）では、追加の推論を引き起こさないため、バッチ実行が有益でした。全体として、制約された設定下では、コマンドのバッチ化はすべてのモデルで一貫してパフォーマンスを向上させるわけではありません。

評価がこのような交絡因子に敏感であるため、次に時間制限を緩和することによってエージェント - モデル組み合わせの上限ポテンシャルを評価します。具体的には、SSA の Terminal-Bench-2 におけるパフォーマンスを、制約された設定（上記参照）と、メモリおよびエージェントタイムアウトが削除された非制約された設定で比較します。非制約されたセットアップは達成可能なパフォーマンスの天井の見積もりとして機能します。制約された評価と非制約された評価の間の精度のギャップは通常 5-10% です。私たちの実験では、Terminal-Bench-2 の合計 89 プロジェクトのうち、いくつかの一貫して高いタイムアウト率を示すプロジェクトが制約された評価で、非制約された設定では高い解決率を示すことがわかりました。それらのプロジェクトは make-doom-for-mips, torch-pipeline-parallelism, gpt2-codegolf, caffe-cifar-10, train-fasttext です。

実験手法

私たちは、制御可能かつ再現可能なセットアップの下で SSA を複数のエージェントベンチマークで評価しました。すべての実験は AWS PCS クラスター上で c7.48xlarge インスタンスを使用して行われ、スループットとシステムの安定性をバランスさせるために最大並列度を 10 に設定しました。モデルアクセスについては、Claude モデルは Amazon Bedrock（生産容量）経由で提供され、OpenAI, Gemini, Grok モデルはそれぞれの商用 API を通じてアクセスされました。

私たちは厳格な評価衛生を適用しました。SWE-Bench-Verified と SWE-Bench-Pro の実行ではインターネットアクセスが無効化され、Terminal-Bench 2 ではベンチマーク設計のため有効化されました。SWE-Bench-Verified と SWE-Bench-Pro には、現在のコード修正時点までのリポジトリ状態を含む標準的なベンチマーク Docker 環境を使用しました。これにより、エージェントはコードベースの関連履歴にアクセスできますが、将来のリビジョンへのアクセスは保証されません。

評価固有の問題

SWE-Bench-Verified では、astropy-8872 や astropy-8707 などのインスタンスは、完璧なコードパッチがあってもセットアップの不整合により失敗し、評価環境の修正が必要です。また、一部の psf_requests インスタンスは、外部テスト依存関係（応答しない URL など）のために断続的に失敗する可能性があり、信頼性のある評価には手動でのパッチ適用が必要です。

SWE-Bench-Pro では、評価は Amazon ECS で実行されました。環境固有の仮定のため、AWS インフラストラクチャで実行すると 731 インスタントのうち 3 つが常に失敗します。結果

原文を表示

AI agent performance is not just a modeling problem; it is fundamentally a systems problem. A modern agent combines an LLM with a harness, software that mediates the LLM’s interaction with tools and manages the cycle of reasoning and feedback: you can think of the harness as the operating system around the model. As models improve, the performance bottleneck shifts from the model’s ability to reason to the harness’s ability to translate model intent into actions and reflect execution outcomes back to the model. We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa. For example, in trying to revise code, a model may intend to edit a single instance of a function, while the harness accidentally modifies multiple instances. We show that minimizing this bidirectional gap — without any task-specific tuning — is sufficient to achieve state-of-the-art performance across diverse agentic benchmarks, including datasets that test real-world repository patching (SWE-Pro, SWE-Verified) and interactive terminal environments (Terminal-Bench2). While the most visible components of the harness — such as the execution graph, which controls iterations over the thought-action-observation process, and tools — are natural candidates for improvement, we highlight that seemingly trivial implementation details lead to nontrivial fluctuations in performance. Factors such as environment interaction timeouts, infrastructure stability, and resource constraints also materially affect performance. Thus, benchmaxing, or reporting higher numbers on benchmarks, may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations. We also introduce Simple Strands Agent (SSA), a lightweight and customizable single-agent harness designed to close the gap between the performance reported in agent documentation and the performance seen in open-source implementations. SSA achieves consistent gains in performance across multiple models and benchmarks. Finally, we show that effective agent design is not entirely model agnostic. While many principles generalize, different model families exhibit distinct preferences in tool usage, feedback interpretation, and context sensitivity, making model-harness codesign a critical factor in achieving optimal performance. Motivations It is well established that problem-specific customizations such as tuned prompts, tailored tools, and specialized execution graphs can improve AI models’ performance in a controlled setting (fixing all other factors, such as evaluation infrastructure). However, we observed that many such optimizations fail to transfer between models. Improvements that work for one model or version often degrade, disappear, or even regress with newer models. This lack of transferability exposes a deeper issue: many optimizations implicitly overfit the behavior of a specific model. As models improve, these behaviors change, making such gains brittle and noncompounding. In the context of agents, this suggests a shift in focus: rather than optimizing for current model behavior, we should identify invariant components — design principles that remain effective across model upgrades, benchmarks, and environments. To identify such invariants, we focus on the model-harness interface — the boundary where model outputs are interpreted and executed and where execution outcomes are communicated back to the model. This interface is the primary locus of failure when agent performance degrades across settings. From this perspective, two fundamental questions emerge: Does the harness understand what the model intends to do? Is the model clear about how the harness interpreted its actions? These questions define the core alignment problem between model and harness and characterize the failure modes we analyze in the following sections. Tool-interface failures We consider the case in which the agent’s goal is code generation. Our agent primarily uses a bash tool, which provides access to the computer terminal (for example, to execute code), and a file editor to revise code. The bash tool is extremely powerful and can consume all the atomic operations of reading, searching, and editing. We make a simple enhancement to manage its outputs when they get too long. Naïvely truncating the output does not work well because the end of a command execution confirmation carries useful information such as job status and command success/failure. Instead, we contain the response length by condensing content in the middle and keeping only a limited number of lines at the beginning and the end. For reasons of efficiency and better corner-case handling in editing, we use file-editing tools in addition to bash. Our file editor is based on a string-replace mechanism that replaces existing file content with new (model-provided) content to produce edits. While string-replace works well in many cases, we repeatedly observed failure modes that expose the intent-execution gap: the model may have a clear intention, but the harness may not have enough information to execute that intention safely. In these cases, a naïve editor does not merely underperform; it can actively damage the working state by applying the wrong edit with high confidence. The first failure mode arises when the context of the model’s proposed edit appears at multiple locations in the codebase. From the model’s perspective, the requested edit may be unambiguous, because it is reasoning about a specific function, block, or error location. But if the harness receives only a raw “replace old text with new text” request, and the old text occurs several times, it cannot reliably infer which occurrence was intended. Naïvely replacing all matches is dangerous. In practice, the safer behavior is for the harness to alert the model of the ambiguity and request clarification — for example, by asking it to expand the current context such that the text to be replaced is unique. This is a small implementation detail, but it sharply improves faithfulness between intended and executed edits. A second failure mode appears when the model proposes only partial lines or short fragments for replacement. Partial-text matching is attractive because it is flexible, but it is also brittle: the same fragment may appear inside comments, string literals, neighboring expressions, or unrelated code paths. Even when the fragment is unique, replacing text that does not constitute a full logical unit — a complete line or well-bounded span — can produce malformed edits. These may be syntactically correct from the editor’s point of view but semantically unintended from the model’s point of view. We found that requiring stronger text anchors — such as exact line spans, richer surrounding context, or line-aware matching — substantially reduces these accidental edits. Put differently, the harness should not execute underspecified edit requests by guessing. Third, even when an edit is applied successfully, simply returning “edit succeeded” leaves the model underinformed about what the harness changed. This weakens the reverse side of the interaction loop: not only should the model express intent clearly, but it should also be able to verify how that intent was interpreted. To close this loop, we found it useful, after every successful edit, to supply the model with a diff file — a text file indicating what additions and deletions had been made and what text stayed the same. A diff serves as an immediate confirmation channel: the model can inspect whether the replacement landed in the correct location, whether collateral lines changed, and whether follow-up edits are needed. This seemingly minor feedback mechanism improves reliability because it converts editing from a fire-and-forget action into an observable state transition. A natural question arises: if the diff is provided after a successful edit, why do the first two failure modes require special handling? While the diff does expose unintended changes, it does so after the mistake has already been applied. At that point, the model must decide whether to roll back, repair the unintended edits, or continue execution with a potentially corrupted state. This introduces additional branching in the agent’s trajectory and forces it to spend tokens and reasoning effort correcting avoidable errors, rather than progressing toward the solution. In other words, every correction step injects additional information into the model’ context window. Note that every piece of information competes for the agent’s attention for next-action generation. Unrelated or unintended edits do not just waste tokens; they actively degrade performance by introducing spurious patterns and relationships, increasing the likelihood that the model forms incorrect associations and drifts away from the original goal. In contrast, addressing ambiguity and weak anchoring before execution ensures that edits are applied correctly in the first place. This reduces unnecessary exploration, prevents cascading errors, and keeps the context focused on task-relevant signals. In effect, the first two failure modes improve correctness at the point of action, while diff feedback improves observability after action. Both are necessary, but they operate at fundamentally different stages of the interaction loop. Reasoning A less obvious but equally important design consideration is how agents balance internal reasoning with external interactions. Chain-of-thought reasoning is clearly valuable. It allows the model to decompose a problem, plan next steps, and decide which tool to invoke. Without sufficient reasoning, tool usage becomes reactive, leading to shallow exploration, redundant calls, or poor sequencing of actions. However, excessive thinking introduces its own failure mode. When the model spends too long reasoning internally, it begins to form assumptions about the environment rather than verifying them. These assumptions may appear coherent within the model’s internal state, but they are often misaligned with the actual system state. As a result, the agent may issue poorly grounded tool calls or skip necessary validation steps altogether, creating a fundamental tension. Effective agents must continuously reconcile these two demands, and we refer to this balance as tool calling with a reasoning nudge. The idea is to encourage the model to perform just enough reasoning to decide the next action and then prioritize evidence-gathering interactions with the environment over further reasoning. Rather than extending internal chains of thought, the agent is nudged toward validating its hypotheses through tool outputs. In practice, we did not find a single “golden prompt” that reliably balances reasoning and tool interaction across all model families. For the Claude variants, we found that introducing quantitative guidance — e.g., “make 50+ tool calls” or “ideal tool call count is 100” — helps break long reasoning chains and pushes the model toward interacting with the environment. While the exact number of target tool calls is not important, it serves as a useful north star that biases the model toward action. However, in our experiments, this strong nudge was ineffective for other families, such as Gemini and Grok, which often interpret such instructions literally and make empty tool calls in order to meet the target. Such behavior reduces agent quality. Here, we find that using a flexible nudge like “You should use tools as much as possible” works just fine. The principle remains the same: we need to nudge the model to proactively use tools along with right amount of reasoning. Tool use preferences Across agents, tools function in exactly the same way, but models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a separate file, formatted in a particular way; denying them their formatting preferences hurts performance. Similarly, for Grok-4.20, a single monolithic tool for editing and viewing creates confusion, which leads to incorrect tool calls. Splitting functionality into atomic operations yields better results — even when the functionality remains unchanged. Additionally, viewing line numbers in a file helps most models, but Grok’s tokenizer and attention mechanism appeared less robust at separating prefixes from line numbers, and disabling this feature helps the view tool. These preferences are a by-product of training. This reinforces a broader design principle: agent performance is a function of not only what tools are available but how naturally those tools align with the model’s learned behaviors. A well-designed harness meets the model where it is, adapting interfaces, feedback, and interaction patterns to its strengths while still enforcing the invariants needed for reliable execution. Benchmarking study SSA is a simple harness that implements many of the principles we describe above. We evaluated it on three agentic benchmarks — SWE-Bench-Verified (n = 500), SWE-Bench-Pro (public set, n = 731) and Terminal-Bench-2 (n = 89). Each example in SWE-Bench-Verified and SWE-Bench-Pro is an open-source code repository and an “issue” to be fixed by making a code change. Terminal-Bench-2 tackles a range of programming tasks (software engineering, machine learning, security, etc.) but is not tied to a code repository. All three benchmarks have individual, static, prewritten tests for evaluating generated code. In SWE-Bench-Verified and SWE-Bench-Pro, the runs and evaluations occur in separate container images, meaning changes must be transferred into a different evaluation environment; in Terminal-Bench-2, the evaluation happens in the same container. Therefore, in SWE problems, it may be necessary to exclude irrelevant artifacts to not overly bloat the diff patch. Additionally, Terminal-Bench-2 imposes computational and agent-runtime limits that the SWE benchmarks do not. We evaluate our SSA agents using metrics standard in the field. Note that the mini-swe-agent results reported above in the SWE-Bench-Verified graph and the Terminus results reported in the Terminal-Bench-2 graph correspond to a fixed agent configuration per benchmark — the exact same prompts, tool specifications, and structural output instructions. As we discuss above, however, different model families require different reasoning nudges and exhibit distinct preferences for tool use. As a result, while SSA’s core harness remains identical, there are minimal but nonzero differences in prompts and tool specifications across model families (e.g., Claude, Gemini, GPT, Grok). Our goal in building SSA was not to optimize separate agents per model but to identify minimal, orthogonal adaptations that allow different model families to express their strongest capabilities within a shared harness framework. Terminal-Bench-2 Unlike SWE-Bench-Verified and SWE-Bench-Pro, the Terminal-Bench-2 dataset restricts the agent’s environment by limiting computational capacity (memory, storage, number of CPUs) and time (both agent and verifier run times) per project. While this is effective in limiting disproportionate use of computational resources to boost benchmark scores, it does have the unintended side effect of making the benchmark more sensitive to infrastructure choices. We observed that, given those restrictions, the following system characteristics have the most impact: Reliability of the inference backend. The inference backend’s capacity (tokens per minute and requests per minute) should be able to support all concurrently run projects for the full duration of the evaluation. High variance in invoker latency, frequent API timeouts, and retries eat into the allowed time budget, leading to more timeouts and a lower resolution rate. The number of concurrent projects run on a single node. This affects the network bandwidth available to each project. One of the first steps for an agent in Terminal-Bench-2 is to install dependencies (popular libraries like pip, torch, transformers, etc.). If the evaluation infrastructure is set up in such a way that multiple projects are run on a single node (e.g., Harbor with n_concurrent > 1), the available network bandwidth for each node is shared across all the concurrent projects. This increases the download times for dependencies, leaving the agent with less time for problem solving and a higher risk of getting interrupted before it’s done. Since the majority of tool calls involve command-line instructions, a natural way to address timeouts is to introduce a batch interface, allowing the agent to execute multiple commands in a single turn, rather than executing them sequentially. In our experiments, however, the results of this approach were mixed and correspond to one of the failure modes we describe above — the balance between reasoning and tool interaction. While batching reduces interaction overhead, it also requires the model to maintain a coherent terminal state across multiple steps, which increases reasoning complexity. For Claude models, the time taken by additional autoregressive reasoning tends to offset the gains from batching. In contrast, for other model families (such as Gemini and Grok), batch execution was beneficial, as it did not trigger additional reasoning. Overall, under constrained settings, batching commands does not consistently improve performance across all models. Given that evaluations are sensitive to such confounding factors, we next assess the upper-bound potential of the agent-model combination by relaxing time constraints. Specifically, we compare SSA’s performance on Terminal-Bench-2 under constrained settings (as shown above) and unconstrained settings, where memory and agent timeouts are removed. The unconstrained setup serves as an estimate of the achievable performance ceiling. The gap in accuracy between the constrained and unconstrained evaluations is typically 5-10%. We note that in our experiments, out of the 89 total projects in Terminal-Bench-2, a few consistently have a high timeout rate in the constrained evaluation but a high solve rate in the unconstrained setting. Those projects are make-doom-for-mips, torch-pipeline-parallelism, gpt2-codegolf, caffe-cifar-10, and train-fasttext. Experimental methodology We evaluate SSA across multiple agent benchmarks under a controlled and reproducible setup. All experiments were conducted on an AWS PCS cluster using c7.48xlarge instances, with maximum concurrency set to 10 to balance throughput and system stability. For model access, Claude models were served via Amazon Bedrock (production capacity), while OpenAI, Gemini, and Grok models were accessed through their respective commercial APIs. We enforced strict evaluation hygiene. Internet access was disabled for SWE-Bench-Verified and SWE-Bench-Pro runs, while it was enabled for Terminal-Bench 2 due to its benchmark design. For SWE-Bench-Verified and SWE-Bench-Pro, we used the standard benchmarking Docker environments, which include repository state up to the point of the current code revision. This allows agents access to the relevant history of the codebase while ensuring no access to future revisions. Evaluation-specific issues In SWE-Bench-Verified, instances such as astropy-8872 and astropy-8707 fail even with flawless code patches due to setup inconsistencies and require fixes in the evaluation environment. Additionally, some psf_requests instances can fail intermittently due to external test dependencies (e.g., nonresponsive URLs), requiring manual patching for reliable evaluation. For SWE-Bench-Pro, evaluations were executed on Amazon ECS. Due to environment-specific assumptions, a small subset of tests — 3 out of 731 instances — consistently fail when run on AWS infrastructure, result

この記事をシェア

Claude Blog重要度42026年7月23日 09:00

Claude、音声モードで難問思考を支援

The Register AI/ML2026年7月24日 05:48

ChatGPT 出力制限に代わるツール登場

Ars Technica AI重要度42026年7月24日 04:08

AIキルスイッチ法案、政府に停止権限付与

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Amazon Science·2026年6月9日 02:00·約22分

エージェントシステムにおける意図と実行の架橋

#AI Agent #LLM #System Architecture #Reasoning #Benchmarking

TL;DR

AI深層分析2026年6月9日 03:02

重要/ 5段階

深度40%

キーポイント

意図と実行のギャップ（Intent-Execution Gap）の定義

ベンチマーク評価におけるインフラ要因の影響

Simple Strands Agent (SSA) の提案

モデル固有の特性とコードデザインの重要性

重要な引用

A modern agent combines an LLM with a harness... you can think of the harness as the operating system around the model.

We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa.

Benchmaxing... may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations.

影響分析・編集コメントを表示

影響分析

編集コメント

動機付け

ツールインターフェースの失敗

推論

ツールの使用嗜好

ベンチマーク研究

Terminal-Bench-2

私たちは、これらの制限が与えられた場合、以下のシステム特性が最も大きな影響を与えることを観察しました：

実験手法

評価固有の問題

原文を表示

この記事をシェア

Claude Blog重要度42026年7月23日 09:00

Claude、音声モードで難問思考を支援

The Register AI/ML2026年7月24日 05:48

ChatGPT 出力制限に代わるツール登場

Ars Technica AI重要度42026年7月24日 04:08

AIキルスイッチ法案、政府に停止権限付与

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

エージェントシステムにおける意図と実行の架橋

キーポイント

重要な引用

影響分析

編集コメント

関連記事

エージェントシステムにおける意図と実行の架橋

キーポイント

重要な引用

影響分析

編集コメント

関連記事