AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
Amazon Science·2026年6月9日 02:00·約15分で読める

エージェントシステムにおける意図と実行の架橋

#AI Agent#LLM#System Architecture#Reasoning#Benchmarking
TL;DR

Amazon Science は、AI エージェントの性能ボトルネックがモデルそのものではなく、意図と実行を繋ぐ「ハネス」にあることを指摘し、両者のギャップ解消が最重要課題であると分析している。

AI深層分析2026年6月9日 03:02
4
重要/ 5段階
深度40%
5
関連度30%
5
実用性20%
4
革新性10%
4

キーポイント

1

意図と実行のギャップ(Intent-Execution Gap)の定義

モデルの推論能力向上に伴い、ボトルネックは「ハネス」による意図の翻訳ミスや実行結果のフィードバック不全にシフトしており、この双方向のミスマッチを最小化することが SOTA 性能達成の鍵となる。

2

ベンチマーク評価におけるインフラ要因の影響

タイムアウト設定やリソース制約などの基礎的なインフラパラメータが結果に大きく影響するため、単純なベンチマーク数値の比較は実力を正確に反映しない可能性があり注意が必要である。

3

Simple Strands Agent (SSA) の提案

ドキュメントとオープンソース実装の乖離を埋めるため、軽量でカスタマイズ可能なシングルエージェントハネス「SSA」を導入し、モデルやタスクに依存しない一貫した性能向上を実現した。

4

モデル固有の特性とコードデザインの重要性

エージェント設計は完全にモデル非依存的ではなく、ツール使用や文脈感度において異なるモデルファミリーが特有の嗜好を示すため、最適な性能にはモデルとハネスの共同設計(codesign)が不可欠である。

影響分析・編集コメントを表示

影響分析

この分析は、AI エージェント開発の焦点を「より賢いモデル」から「より堅牢なシステムアーキテクチャ」へとシフトさせるべきという重要な示唆を与えます。特に、ベンチマーク評価におけるインフラ要因の影響を指摘した点は、業界全体が真の実力を測定する基準を見直す契機となり得ます。また、モデル固有の特性への言及は、汎用的なフレームワーク開発から、特定のモデルに最適化されたカスタム設計への転換を促す可能性があります。

編集コメント

モデルの推論能力が飛躍的に向上する中、その性能を最大限引き出すための「システム側」の課題解決が急務であることを浮き彫りにした鋭い洞察です。開発者は今、単なるプロンプトエンジニアリングから、ハネスとモデルの協調設計へと視座を移す必要があります。

AI agent performance is not just a modeling problem; it is fundamentally a systems problem. A modern agent combines an LLM with a harness, software that mediates the LLM’s interaction with tools and manages the cycle of reasoning and feedback: you can think of the harness as the operating system around the model. As models improve, the performance bottleneck shifts from the model’s ability to reason to the harness’s ability to translate model intent into actions and reflect execution outcomes back to the model. We formalize this bottleneck as the intent-execution gap: the mismatch between what the model intends and what the harness executes, and vice versa. For example, in trying to revise code, a model may intend to edit a single instance of a function, while the harness accidentally modifies multiple instances. We show that minimizing this bidirectional gap — without any task-specific tuning — is sufficient to achieve state-of-the-art performance across diverse agentic benchmarks, including datasets that test real-world repository patching (SWE-Pro, SWE-Verified) and interactive terminal environments (Terminal-Bench2). While the most visible components of the harness — such as the execution graph, which controls iterations over the thought-action-observation process, and tools — are natural candidates for improvement, we highlight that seemingly trivial implementation details lead to nontrivial fluctuations in performance. Factors such as environment interaction timeouts, infrastructure stability, and resource constraints also materially affect performance. Thus, benchmaxing, or reporting higher numbers on benchmarks, may not necessarily quantify underlying model/harness capability, as it is additionally influenced by the basic infrastructure parameters used during evaluations. We also introduce Simple Strands Agent (SSA), a lightweight and customizable single-agent harness designed to close the gap between the performance reported in agent documentation and the performance seen in open-source implementations. SSA achieves consistent gains in performance across multiple models and benchmarks. Finally, we show that effective agent design is not entirely model agnostic. While many principles generalize, different model families exhibit distinct preferences in tool usage, feedback interpretation, and context sensitivity, making model-harness codesign a critical factor in achieving optimal performance. Motivations It is well established that problem-specific customizations such as tuned prompts, tailored tools, and specialized execution graphs can improve AI models’ performance in a controlled setting (fixing all other factors, such as evaluation infrastructure). However, we observed that many such optimizations fail to transfer between models. Improvements that work for one model or version often degrade, disappear, or even regress with newer models. This lack of transferability exposes a deeper issue: many optimizations implicitly overfit the behavior of a specific model. As models improve, these behaviors change, making such gains brittle and noncompounding. In the context of agents, this suggests a shift in focus: rather than optimizing for current model behavior, we should identify invariant components — design principles that remain effective across model upgrades, benchmarks, and environments. To identify such invariants, we focus on the model-harness interface — the boundary where model outputs are interpreted and executed and where execution outcomes are communicated back to the model. This interface is the primary locus of failure when agent performance degrades across settings. From this perspective, two fundamental questions emerge: Does the harness understand what the model intends to do? Is the model clear about how the harness interpreted its actions? These questions define the core alignment problem between model and harness and characterize the failure modes we analyze in the following sections. Tool-interface failures We consider the case in which the agent’s goal is code generation. Our agent primarily uses a bash tool, which provides access to the computer terminal (for example, to execute code), and a file editor to revise code. The bash tool is extremely powerful and can consume all the atomic operations of reading, searching, and editing. We make a simple enhancement to manage its outputs when they get too long. Naïvely truncating the output does not work well because the end of a command execution confirmation carries useful information such as job status and command success/failure. Instead, we contain the response length by condensing content in the middle and keeping only a limited number of lines at the beginning and the end. For reasons of efficiency and better corner-case handling in editing, we use file-editing tools in addition to bash. Our file editor is based on a string-replace mechanism that replaces existing file content with new (model-provided) content to produce edits. While string-replace works well in many cases, we repeatedly observed failure modes that expose the intent-execution gap: the model may have a clear intention, but the harness may not have enough information to execute that intention safely. In these cases, a naïve editor does not merely underperform; it can actively damage the working state by applying the wrong edit with high confidence. The first failure mode arises when the context of the model’s proposed edit appears at multiple locations in the codebase. From the model’s perspective, the requested edit may be unambiguous, because it is reasoning about a specific function, block, or error location. But if the harness receives only a raw “replace old text with new text” request, and the old text occurs several times, it cannot reliably infer which occurrence was intended. Naïvely replacing all matches is dangerous. In practice, the safer behavior is for the harness to alert the model of the ambiguity and request clarification — for example, by asking it to expand the current context such that the text to be replaced is unique. This is a small implementation detail, but it sharply improves faithfulness between intended and executed edits. A second failure mode appears when the model proposes only partial lines or short fragments for replacement. Partial-text matching is attractive because it is flexible, but it is also brittle: the same fragment may appear inside comments, string literals, neighboring expressions, or unrelated code paths. Even when the fragment is unique, replacing text that does not constitute a full logical unit — a complete line or well-bounded span — can produce malformed edits. These may be syntactically correct from the editor’s point of view but semantically unintended from the model’s point of view. We found that requiring stronger text anchors — such as exact line spans, richer surrounding context, or line-aware matching — substantially reduces these accidental edits. Put differently, the harness should not execute underspecified edit requests by guessing. Third, even when an edit is applied successfully, simply returning “edit succeeded” leaves the model underinformed about what the harness changed. This weakens the reverse side of the interaction loop: not only should the model express intent clearly, but it should also be able to verify how that intent was interpreted. To close this loop, we found it useful, after every successful edit, to supply the model with a diff file — a text file indicating what additions and deletions had been made and what text stayed the same. A diff serves as an immediate confirmation channel: the model can inspect whether the replacement landed in the correct location, whether collateral lines changed, and whether follow-up edits are needed. This seemingly minor feedback mechanism improves reliability because it converts editing from a fire-and-forget action into an observable state transition. A natural question arises: if the diff is provided after a successful edit, why do the first two failure modes require special handling? While the diff does expose unintended changes, it does so after the mistake has already been applied. At that point, the model must decide whether to roll back, repair the unintended edits, or continue execution with a potentially corrupted state. This introduces additional branching in the agent’s trajectory and forces it to spend tokens and reasoning effort correcting avoidable errors, rather than progressing toward the solution. In other words, every correction step injects additional information into the model’ context window. Note that every piece of information competes for the agent’s attention for next-action generation. Unrelated or unintended edits do not just waste tokens; they actively degrade performance by introducing spurious patterns and relationships, increasing the likelihood that the model forms incorrect associations and drifts away from the original goal. In contrast, addressing ambiguity and weak anchoring before execution ensures that edits are applied correctly in the first place. This reduces unnecessary exploration, prevents cascading errors, and keeps the context focused on task-relevant signals. In effect, the first two failure modes improve correctness at the point of action, while diff feedback improves observability after action. Both are necessary, but they operate at fundamentally different stages of the interaction loop. Reasoning A less obvious but equally important design consideration is how agents balance internal reasoning with external interactions. Chain-of-thought reasoning is clearly valuable. It allows the model to decompose a problem, plan next steps, and decide which tool to invoke. Without sufficient reasoning, tool usage becomes reactive, leading to shallow exploration, redundant calls, or poor sequencing of actions. However, excessive thinking introduces its own failure mode. When the model spends too long reasoning internally, it begins to form assumptions about the environment rather than verifying them. These assumptions may appear coherent within the model’s internal state, but they are often misaligned with the actual system state. As a result, the agent may issue poorly grounded tool calls or skip necessary validation steps altogether, creating a fundamental tension. Effective agents must continuously reconcile these two demands, and we refer to this balance as tool calling with a reasoning nudge. The idea is to encourage the model to perform just enough reasoning to decide the next action and then prioritize evidence-gathering interactions with the environment over further reasoning. Rather than extending internal chains of thought, the agent is nudged toward validating its hypotheses through tool outputs. In practice, we did not find a single “golden prompt” that reliably balances reasoning and tool interaction across all model families. For the Claude variants, we found that introducing quantitative guidance — e.g., “make 50+ tool calls” or “ideal tool call count is 100” — helps break long reasoning chains and pushes the model toward interacting with the environment. While the exact number of target tool calls is not important, it serves as a useful north star that biases the model toward action. However, in our experiments, this strong nudge was ineffective for other families, such as Gemini and Grok, which often interpret such instructions literally and make empty tool calls in order to meet the target. Such behavior reduces agent quality. Here, we find that using a flexible nudge like “You should use tools as much as possible” works just fine. The principle remains the same: we need to nudge the model to proactively use tools along with right amount of reasoning. Tool use preferences Across agents, tools function in exactly the same way, but models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a separate file, formatted in a particular way; denying them their formatting preferences hurts performance. Similarly, for Grok-4.20, a single monolithic tool for editing and viewing creates confusion, which leads to incorrect tool calls. Splitting functionality into atomic operations yields better results — even when the functionality remains unchanged. Additionally, viewing line numbers in a file helps most models, but Grok’s tokenizer and attention mechanism appeared less robust at separating prefixes from line numbers, and disabling this feature helps the view tool. These preferences are a by-product of training. This reinforces a broader design principle: agent performance is a function of not only what tools are available but how naturally those tools align with the model’s learned behaviors. A well-designed harness meets the model where it is, adapting interfaces, feedback, and interaction patterns to its strengths while still enforcing the invariants needed for reliable execution. Benchmarking study SSA is a simple harness that implements many of the principles we describe above. We evaluated it on three agentic benchmarks — SWE-Bench-Verified (n = 500), SWE-Bench-Pro (public set, n = 731) and Terminal-Bench-2 (n = 89). Each example in SWE-Bench-Verified and SWE-Bench-Pro is an open-source code repository and an “issue” to be fixed by making a code change. Terminal-Bench-2 tackles a range of programming tasks (software engineering, machine learning, security, etc.) but is not tied to a code repository. All three benchmarks have individual, static, prewritten tests for evaluating generated code. In SWE-Bench-Verified and SWE-Bench-Pro, the runs and evaluations occur in separate container images, meaning changes must be transferred into a different evaluation environment; in Terminal-Bench-2, the evaluation happens in the same container. Therefore, in SWE problems, it may be necessary to exclude irrelevant artifacts to not overly bloat the diff patch. Additionally, Terminal-Bench-2 imposes computational and agent-runtime limits that the SWE benchmarks do not. We evaluate our SSA agents using metrics standard in the field. Note that the mini-swe-agent results reported above in the SWE-Bench-Verified graph and the Terminus results reported in the Terminal-Bench-2 graph correspond to a fixed agent configuration per benchmark — the exact same prompts, tool specifications, and structural output instructions. As we discuss above, however, different model families require different reasoning nudges and exhibit distinct preferences for tool use. As a result, while SSA’s core harness remains identical, there are minimal but nonzero differences in prompts and tool specifications across model families (e.g., Claude, Gemini, GPT, Grok). Our goal in building SSA was not to optimize separate agents per model but to identify minimal, orthogonal adaptations that allow different model families to express their strongest capabilities within a shared harness framework. Terminal-Bench-2 Unlike SWE-Bench-Verified and SWE-Bench-Pro, the Terminal-Bench-2 dataset restricts the agent’s environment by limiting computational capacity (memory, storage, number of CPUs) and time (both agent and verifier run times) per project. While this is effective in limiting disproportionate use of computational resources to boost benchmark scores, it does have the unintended side effect of making the benchmark more sensitive to infrastructure choices. We observed that, given those restrictions, the following system characteristics have the most impact: Reliability of the inference backend. The inference backend’s capacity (tokens per minute and requests per minute) should be able to support all concurrently run projects for the full duration of the evaluation. High variance in invoker latency, frequent API timeouts, and retries eat into the allowed time budget, leading to more timeouts and a lower resolution rate. The number of concurrent projects run on a single node. This affects the network bandwidth available to each project. One of the first steps for an agent in Terminal-Bench-2 is to install dependencies (popular libraries like pip, torch, transformers, etc.). If the evaluation infrastructure is set up in such a way that multiple projects are run on a single node (e.g., Harbor with n_concurrent > 1), the available network bandwidth for each node is shared across all the concurrent projects. This increases the download times for dependencies, leaving the agent with less time for problem solving and a higher risk of getting interrupted before it’s done. Since the majority of tool calls involve command-line instructions, a natural way to address timeouts is to introduce a batch interface, allowing the agent to execute multiple commands in a single turn, rather than executing them sequentially. In our experiments, however, the results of this approach were mixed and correspond to one of the failure modes we describe above — the balance between reasoning and tool interaction. While batching reduces interaction overhead, it also requires the model to maintain a coherent terminal state across multiple steps, which increases reasoning complexity. For Claude models, the time taken by additional autoregressive reasoning tends to offset the gains from batching. In contrast, for other model families (such as Gemini and Grok), batch execution was beneficial, as it did not trigger additional reasoning. Overall, under constrained settings, batching commands does not consistently improve performance across all models. Given that evaluations are sensitive to such confounding factors, we next assess the upper-bound potential of the agent-model combination by relaxing time constraints. Specifically, we compare SSA’s performance on Terminal-Bench-2 under constrained settings (as shown above) and unconstrained settings, where memory and agent timeouts are removed. The unconstrained setup serves as an estimate of the achievable performance ceiling. The gap in accuracy between the constrained and unconstrained evaluations is typically 5-10%. We note that in our experiments, out of the 89 total projects in Terminal-Bench-2, a few consistently have a high timeout rate in the constrained evaluation but a high solve rate in the unconstrained setting. Those projects are make-doom-for-mips, torch-pipeline-parallelism, gpt2-codegolf, caffe-cifar-10, and train-fasttext. Experimental methodology We evaluate SSA across multiple agent benchmarks under a controlled and reproducible setup. All experiments were conducted on an AWS PCS cluster using c7.48xlarge instances, with maximum concurrency set to 10 to balance throughput and system stability. For model access, Claude models were served via Amazon Bedrock (production capacity), while OpenAI, Gemini, and Grok models were accessed through their respective commercial APIs. We enforced strict evaluation hygiene. Internet access was disabled for SWE-Bench-Verified and SWE-Bench-Pro runs, while it was enabled for Terminal-Bench 2 due to its benchmark design. For SWE-Bench-Verified and SWE-Bench-Pro, we used the standard benchmarking Docker environments, which include repository state up to the point of the current code revision. This allows agents access to the relevant history of the codebase while ensuring no access to future revisions. Evaluation-specific issues In SWE-Bench-Verified, instances such as astropy-8872 and astropy-8707 fail even with flawless code patches due to setup inconsistencies and require fixes in the evaluation environment. Additionally, some psf_requests instances can fail intermittently due to external test dependencies (e.g., nonresponsive URLs), requiring manual patching for reliable evaluation. For SWE-Bench-Pro, evaluations were executed on Amazon ECS. Due to environment-specific assumptions, a small subset of tests — 3 out of 731 instances — consistently fail when run on AWS infrastructure, result

この記事をシェア

関連記事

Latent Space★42026年6月6日 13:34

[AINews] 今日特に大きな出来事はありませんでした

Latent Space が運営するニュースレター「AINews」が、6月4日から5日にかけてのAI業界動向を12件のRedditスレッドや544件のTwitter投稿から選別して紹介しました。記事ではRL環境ガイドの推奨や、DeepSeek v4 Pro向けの最適化に関するリモートポッドの更新について言及しています。

NVIDIA Developer Blog★42026年6月4日 22:02

NVIDIA Nemotron 3 Ultra が長時間実行型エージェントの推論を高速化・効率化

NVIDIA は、長時間実行型エージェントが推論を行い、文脈を維持し、ツールを活用して効率的に動作するための新モデル「Nemotron 3 Ultra」を発表した。これにより、単発チャットボットから複雑なタスクをこなすエージェントへの進化が加速する。

404 Media★42026年6月9日 00:50

Microsoft、Claude および Gemini ユーザーへマルウェアを配布する目的でハッキングされた件

サイバーセキュリティ研究者の調査と Microsoft の声明によると、ハッカーが AI コーディングツールの Claude や Gemini で開封した際にユーザーの認証情報を窃取するマルウェアを仕掛け、Microsoft は Azure および AI コーディングエージェント関連のリポジトリを含む一連の GitHub リポジトリを停止してデータ侵害を調査している。

ニュース一覧に戻る元記事を読む