AWS Machine Learning Blog·2026年5月5日 02:13·約12分

エージェント性能ループの紹介：AgentCore Optimization がプレビュー開始

#AgentCore #LLM #A/B テスト #RAG #AWS

TL;DR

AWS は Amazon Bedrock AgentCore に、生産環境のトレース分析に基づく推奨機能とバッチ評価・A/B テストによる検証機能を追加し、AI エージェントのパフォーマンス改善を自動化する新しいループを開始した。

AI深層分析2026年5月5日 03:03

重要/ 5段階

深度40%

キーポイント

エージェント品質の自動最適化ループの確立

従来の手動デバッグや直感に頼るプロセスから脱却し、生産環境のトレースを分析してシステムプロンプトやツール記述を自動的に推奨する機能を導入した。

多角的な検証メカニズムの提供

事前に定義されたテストデータセットを用いたバッチ評価に加え、LLM を用いたエンドユーザーシミュレーションによる検証や、本番環境での A/B テスト機能を統合した。

大規模展開とセキュリティの強化

複雑なワークフローで推論・計画・実行を行うエージェントをスケールして構築・接続するプラットフォームとして、インフラ層でのセキュリティ強制を維持しつつ改善プロセスを加速させる。

影響分析・編集コメントを表示

影響分析

この発表は、AI エージェント開発における「運用と改善」のボトルネックを解消し、大規模なエージェントシステムの実用化を加速させる重要な転換点となります。開発者が手動でトレースを分析する負担を減らし、データに基づいた迅速な意思決定を可能にすることで、製品チームが市場変化に対応するスピードを劇的に向上させます。

編集コメント

AI エージェントの運用フェーズにおける課題解決として、AWS が提供する包括的な最適化ツールセットは非常に注目すべき進展です。特に、LLM を活用したシミュレーション機能により、実環境でのリスクを低減しながら改善を繰り返せる点は、開発現場にとって即効性のある価値があります。

*生成された推奨事項を実行トレースから導出し、バッチ評価と A/B テストで検証し、確信を持ってリリースする。*

起動時に良好なパフォーマンスを発揮する AI エージェントも、その状態を長く維持できるわけではありません。モデルが進化し、ユーザーの行動が変化し、プロンプトが設計当初想定されていなかった新しい文脈で再利用されるにつれ、エージェントの品質は静かに低下していきます。多くのチームにおける改善プロセスは依然として同じです：自動フィードバックループがないため、ユーザーから苦情が寄せられると、開発者がトレースを読み込み、仮説を立て、プロンプトを書き換え、数ケースをテストして修正をリリースします。その後、このサイクルが繰り返され、しばしば別のユーザーに新たな問題を引き起こすことになります。今日に至るまで、Amazon Bedrock AgentCore は、手動でのデバッグやカスタム実装の構築に必要な要素を提供していました：評価スコアを確認して品質低下を検出し、トレースを深く掘り下げて根本原因を特定し、改善された設定でエージェントを更新します。開発者がパフォーマンスのエンジンとして機能していますが、それは直感に頼るものであり、体系的なデータに基づく証拠には依存していません。専任の科学チームや大規模な集中型ベンチマークは役立ちますが、ほとんどの製品チームにとって実用的かつタイムリーな解決策ではありません。そのような仕組みがあったとしても、通常は週次または月次のサイクルで動く一方、プロダクション環境におけるエージェントのドリフトは毎日発生します。

AgentCore は、インフラストラクチャ層でセキュリティを確保しながら、大規模にエージェントの構築・接続・最適化を行うプラットフォームです。すでに数千名の開発者が AgentCore を活用し、複雑なワークフローにおいて推論・計画・実行を行うエージェントを構築しています。本日、エージェントのパフォーマンスと品質に関する「観測・評価・改善」ループを完結させる新機能を発表します。これには推奨事項の提示と、その検証を行う 2 つの方法が含まれます。

Recommendations は、指定した評価者向けにシステムプロンプトやツール記述を最適化するために、本番環境のトレースと評価出力を分析します。Batch evaluation（バッチ評価）は、事前に定義されたテストデータセットに対して推奨事項を検証し、集計スコアを報告することで、重要なケースにおける回帰現象を検出します。手動で作成したシナリオでは不十分な場合、LLM を活用したアクターを用いてエンドユーザーの役割を演じさせることで simulate a dataset（データセットのシミュレーション）も可能です。A/B testing は、AgentCore Gateway を介してエージェントのバージョン間で制御された比較を行い、設定した割合で本番トラフィックを分割し、信頼区間と統計的有意性を伴って結果を報告します。推奨事項が変更を提案し、バッチ評価と A/B テストがそれらを検証することで、トレースを読み込み、修正を推測して盲目的にデプロイする手動サイクルに代わる仕組みとなります。

「エージェントの継続的な評価と改善は、データ駆動型の価値創造を推進するために不可欠です。従来は数週間を要していた手動のプロンプト調整プロセスは、AgentCore の活用により、迅速で反復可能なサイクルへと進化しました。本番環境のトレースデータから改善推奨事項を導出し、A/B テストを通じてその影響を検証することで、組織は精度と有効性を確保しながらパフォーマンスを最適化できます。このアプローチにより、大規模かつ継続的で極めて効率的な改善が可能になります。」NTT DATA 生成 AI ビジネス戦略部長岡田佳治郎

ループの実践的な運用方法**

モデルアップグレードのシナリオにおけるループの動作は以下の通りです。このパターンは、プロンプトのリファクタリング、ツールセットの更新、フレームワークのアップグレードなど、あらゆる変更において同じです。

AgentCore によるエンドツーエンドの追跡可能性は、AgentCore Observability を用いて管理される OpenTelemetry 互換トレースとして、すべてのモデル呼び出し、ツール呼び出し、推論ステップを記録します。評価では、組み込みの評価器、正解データとの比較、またはカスタム LLM-as-judge スコアリングを用いて、ゴール達成率、ツールの選択精度、有用性、安全性などの次元においてこれらのトレースを自動的にスコアリングします。

推奨事項を生成する。 Recommendations API を、エージェントがトレースを記録する CloudWatch Log グループに指向させます。最適化したい評価指標として報酬シグナルを選択し、AgentCore に標準で用意された評価器か、あるいは自社で構築したカスタム評価器のいずれかを選びます。さらに、最適化対象としてシステムプロンプトまたはツール説明のどちらを選ぶかも決定します。AgentCore は記録されたトレースを再検討し、提供された報酬シグナルを考慮して、その報酬シグナルにおけるパフォーマンス向上を目指す推奨事項を生成します。ツール説明に関する推奨事項の場合、ツールの実装には一切手を加えず、ツール説明のみを洗練させます。サービスが提案を行い、検証ステップへ引き継ぐ内容を最終的に決定するのはユーザーです。

変更点を設定バンドルとしてパッケージ化する。 設定は、ランタイム ARN（モデル ID、システムプロンプト、ツール説明）をキーとする、不変でバージョン管理されたエージェントの設定のスナップショットである「バンドル」として提供されます。エージェントは AgentCore SDK を通じて、実行時にアクティブな設定を動的に読み取ります。そのため、プロンプトやモデルの差し替えはコードの変更ではなく、設定の変更として扱われます。現在の設定用と推奨事項用の 2 つのバンドルを作成します。バンドルの作成は任意です。コードを含む変更を行う場合は、代わりに別の実行エンドポイントへデプロイしてください。

オフライン検証：バッチ評価。 新しいバンドルを使用して、キュレーションされたデータセットに対してエージェントを実行し、生成されたセッションをバッチで評価して、集計スコアをベースラインと比較します。これにより、すでに定義されているユースケースにおける回帰を検出できます。チームは通常、バッチ評価を CI/CD パイプラインに組み込んでおり、既知の良好なケースを通過しない限り、構成変更が本番環境に到達することはありません。

ライブトラフィックに対する検証：A/B テスト。 AgentCore Gateway を設定して、現在のバージョンをコントロールとし、候補を処理（treatment）として、本番環境のライブトラフィックを 2 つの変種間で分割します。変種は、構成のみの変更の場合は同じランタイム上の異なるバンドルバージョンである場合もあれば、コードを含む変更の場合は別々のランタイムエンドポイントに指す異なるゲートウェイターゲットである場合があります。オンライン評価では、指定された評価者を使用してすべてのセッションにスコアを付けます。A/B テストの結果には信頼区間と p 値が含まれます。新しいバージョンのパフォーマンスに対する自信が得られる十分なデータが揃った時点でテストを停止し、デフォルトとして設定することで新変種をプロモートします。ロールバックする場合は、テストを一時停止すると、エージェントは既存の構成に自動的に戻ります。

「手動のプロンプト反復に数週間を要していた作業が、AgentCore によって再現可能なサイクルとなりました。本番環境のトレースから推奨事項を生成し、統計的有意性をもって実稼働トラフィックで検証し、勝者となる設定をデプロイします。各サイクルは次のサイクルのためのベースラインデータを生成し、改善プロセスは複利効果を生みます。」——野村総合研究所株式会社上席執行役員清水雅志氏

目指す方向

今回のプレビューは設計上、開発者がトリガーする形となっています。推奨事項の生成タイミング、評価対象となる evaluator（評価器）の選択、結果のプロモーション可否をすべてあなたが決定します。私たちのビジョンは、トレースが評価にフィードバックし、評価がドリフト（逸脱）を浮き彫りにし、推奨事項がそのシグナルを実践的な変更へと変換し、A/B テストでその有効性を証明するというフライングホイール（回転車輪）です。勝者となる設定が新たなベースラインとなり、そこから生成されるトレースが次のサイクルの入力となります。時を経るにつれて、このフライングホイールはより少ない労力で回転します。推奨事項は複数の evaluator を統合的に重み付けし、証拠に基づいたトレードオフを浮き彫りにします。また、最適化の範囲をスキル（技能）へと拡大し、本番環境での利用状況に基づいて新しいスキルの提案や既存スキルの改良を行います。トレース分析では、本番環境で発生する失敗をパターンにクラスタリング（集約）し、それが増殖する前に解決できる状態を作ります。モニタリングアラームは、評価器が閾値を下回った際に自動的に推奨事項の生成と検証を開始し、結果をレビューキューへ配置します。何がリリースされるかはあなたが決定し、システムはそのための重労働を引き受けます。

実際の動作を確認する

GitHub の Market Trends Agent サンプルは、投資仲介業者向けに構築された市場インテリジェンスエージェントであり、リアルタイムの株価データ、セクター分析、ニュース検索、およびパーソナライズされた仲介者プロファイルに対応しています。リスクプロファイル、セクターへの関心、会話スタイルが異なる仲介者を対象とするエージェントにおいて、品質の低下は発見しにくく、適切なツールリングなしでは修正も困難です。

改善ループを完全に実行してください：仲介者の明示的な戦略にアドバイス个性化できていない箇所や、クエリが複数のセクターにまたがる際に誤ったツールを選択している箇所を浮き彫りにする推奨事項を生成します。変更内容を設定バンドルのバージョンとしてパッケージ化し、厳選された仲介者会話セット全体でバッチ評価を通じて修正を検証します。その後、統計的な信頼性を持って本番環境の仲介者セッションに対して A/B テストを行い、本番環境へのプロモーションを行います。

はじめに

これらの機能は、現在 Amazon Bedrock AgentCore の利用可能な AWS リージョン（AgentCore Evaluations が利用可能な地域）のプレビューを通じてご利用いただけます。プレビュー期間中、AgentCore Optimization は、AgentCore Runtime にデプロイされ、AgentCore Observability および Evaluations を使用しているエージェントに対して、システムプロンプトとツール記述を対象としています。

AgentCore コンソールまたは CLI を通じて開始してください。ドキュメントをお読みいただき、こちらのステップバイステップチュートリアルに従ってください。

著者について

image

アマンディープ・クルラナ

アマンディープ・クルラナは、Amazon Bedrock AgentCore でエージェント運用およびパフォーマンスツールリングに注力するシニアプロダクトマネージャーです。最先端の技術で製品を構築し、顧客がビジネス課題を解決するためにそれらを採用できるよう支援することに情熱を持っています。

image

ニキル・カンドイ

ニキル・カンドイは、AgentCore チームのシニアエンジニアです。ニキル氏は、AWS Lex、Panorama、Amazon Q など複数の AI サービスにまたがるインテリジェントシステムの構築とスケーリングにおいて深い専門知識を持っています。現在、大規模なエージェント展開を信頼性が高く安全なものとするための、エンタープライズスケールでの AI エージェントのデプロイおよび管理における課題に取り組んでいます。

image

Bharathi Srinivasan

Bharathi Srinivasan は AWS のシニア生成 AI データサイエンティストです。Bharathi は、非決定論的システムの堅牢性と検証、GenAI およびエージェント型 AI プラットフォームのガバナンス、動的なエージェント型 AI システムの品質など、大規模な生成 AI に関する課題において企業顧客と協力しています。

原文を表示

*Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence.*

AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were never designed for. Agent quality quietly degrades. In most teams, the improvement process still looks the same: without automatic feedback loops, when a user complains, a developer reads through traces, forms a hypothesis, rewrites the prompt, tests a handful of cases, and ships the fix. Then the cycle repeats, often introducing a new issue for a different user. Up until today, Amazon Bedrock AgentCore provided the pieces for you to debug it manually or build custom implementations: check the evaluation scores to detect quality drop, deep dive into the traces to determine the root cause and update the agent with an improved configuration. The developer is the performance engine relying on intuition rather than on systematic data-backed evidence. Dedicated science teams and large centralized benchmarks help, but they are neither a practical nor timely solution for most product teams. Even when you have that machinery, it tends to move on weekly or monthly cycles, while agents drift in production every day.

AgentCore is the platform to build, connect, and optimize agents at scale, with security enforced at the infrastructure layer. Thousands of developers already use AgentCore to build agents that reason, plan, and act across complex workflows. Today we are announcing new capabilities in AgentCore that complete the observe, evaluate, improve loop for agent performance and quality: recommendations and two ways to validate them.

Recommendations analyze production traces and evaluation outputs to optimize your system prompt or tool descriptions for the evaluator you specify. Batch evaluation helps test the recommendation against a pre-defined test dataset and reports aggregate scores, catching regressions on cases you know matter. When hand-authored scenarios aren’t enough, you can also simulate a dataset using an LLM-backed actor to play the role of an end user. A/B testing runs a controlled comparison between versions of an agent through AgentCore Gateway, splitting live production traffic at the percentage you configure and reporting results with confidence intervals and statistical significance. Recommendations propose changes, batch evaluation and A/B testing validate them, and together they replace the manual cycle of reading traces, guessing at fixes, and deploying blind.

“Continuously evaluating and improving agents is essential for driving data-driven value creation. Processes that traditionally required weeks of manual prompt tuning have evolved into rapid, repeatable cycles through the use of AgentCore. By deriving improvement recommendations from production trace data and validating their impact through A/B testing, organizations can optimize performance while ensuring accuracy and effectiveness. This approach enables continuous, highly efficient improvement at scale.” Yoshiharu Okuda, Head of Generative AI Business Strategy Department, NTT DATA

How the loop runs in practice

Here is how the loop runs for the model upgrade scenario. The pattern is the same for any change: a prompt refactor, a tool set update, a framework upgrade.

End-to-end traceability in AgentCore captures every model call, tool invocation, and reasoning step as OpenTelemetry-compatible traces managed using AgentCore Observability. Evaluations score those traces automatically across dimensions like goal success rate, tool selection accuracy, helpfulness, and safety, using built-in evaluators, ground-truth comparisons, or custom LLM-as-judge scoring.

Generate a recommendation. Point the Recommendations API at the CloudWatch Log group where your agent writes traces. Pick the reward signal as the evaluator you want to optimize for, either a built-in evaluator from AgentCore or a custom evaluator you’ve built, and choose what to optimize: the system prompt or the tool descriptions. AgentCore reflects on the traces, considering the provided reward signal, and generates a recommendation aimed at improving the performance on that reward signal. For tool description recommendations, it only sharpens the tool description without touching the tool implementation. The service proposes, and you decide what to take forward into the validation steps.

Package the change as a configuration bundle. Configurations ship as bundles, which are immutable, versioned snapshots of your agent’s configuration keyed by runtime ARN: model ID, system prompt, tool descriptions. Your agent reads its active configuration dynamically at runtime through the AgentCore SDK, so swapping a prompt or a model is a configuration change, not a code change. Create one bundle for your current configuration and another for the recommendation. Bundles are optional. For changes that include code, deploy to a separate runtime endpoint instead.

Validate offline: batch evaluation. Run your agent against a curated data set using the new bundle, then evaluate the resulting sessions in batch and compare aggregate scores to your baseline. This catches regressions on use cases you have already defined. Teams typically wire batch evaluation into their CI/CD pipelines so no configuration change reaches production without passing their known-good cases.

Validate against live traffic: A/B testing. Configure AgentCore Gateway to split live production traffic between two variants, with the current version as the control and the candidate as the treatment. Variants can be different bundle versions on the same runtime for configuration-only changes, or different gateway targets pointing to separate runtime endpoints for changes that include code. Online evaluation scores every session with your specified evaluators. The A/B test results include confidence intervals and p-values. When you have adequate data to give you confidence in the new version’s performance, stop the test and promote the new variant by setting it as the default. To roll back, pause the test and the agent reverts to its existing configuration.

*“What took weeks of manual prompt iteration is now a repeatable cycle with AgentCore: generate a recommendation from production traces, validate it against live traffic with statistical significance, and deploy the winning configuration. Each cycle produces the baseline data for the next — the improvement process compounds.*” — Masashi Shimizu, Senior Managing Director, Nomura Research Institute, Ltd.

Where we’re headed

Today’s preview is developer-triggered by design. You choose when to generate a recommendation, which evaluator to target, and whether to promote the result. Our vision is a flywheel where traces feed evaluations, evaluations surface drift, recommendations turn that signal into a concrete change, and A/B testing proves it works. The winning configuration becomes the new baseline, and the traces it produces are the input for the next cycle.Over time, the flywheel spins with less effort. Recommendations weigh multiple evaluators together, surfacing trade-offs with evidence. They also expand the optimization surface to skills, proposing new ones or refining existing ones based on production usage. Trace analysis clusters production failures into patterns you can address before they multiply. Monitor alarms launch a recommendation and validation on their own when an evaluator drops below a threshold, landing the result in a review queue. You decide what ships, and the system can do the heavy lifting to get there.

See it in action

The Market Trends Agent sample on GitHub is a market intelligence agent built for investment brokers covering real-time stock data, sector analysis, news search, and personalized broker profiles. For an agent serving brokers with different risk profiles, sector interests, and conversational styles, quality degradation is hard to spot and harder to fix without the right tooling.

Walk through the full improvement loop: generate a recommendation that surfaces where the agent fails to personalize advice to a broker’s stated strategy or selects the wrong tool when a query spans multiple sectors. Package the change as a configuration bundle version. Validate the fix with batch evaluation across a curated set of broker conversations. Then A/B test the configuration against real broker sessions with statistical confidence before promoting it to production.

Get started

These capabilities are available in preview today through Amazon Bedrock AgentCore in AWS Regions where AgentCore Evaluations is available. During preview, AgentCore Optimization targets system prompts and tool descriptions for agents deployed on AgentCore Runtime and using AgentCore Observability and Evaluations.

Get started through the AgentCore Console or CLI. Read the documentation and follow step by step tutorials here.

About the authors

Amandeep Khurana

Amandeep Khurana is a Principal Product Manager, working on Amazon Bedrock AgentCore, focusing on agent operations and performance tooling. He’s passionate about building products at the cutting edge of technology and helping customers adopt them to solve their business problems.

Nikhil Kandoi

Nikhil Kandoi is a Principal Engineer on the AgentCore team. Nikhil brings deep expertise in building and scaling intelligent systems spanning multiple AI services like AWS Lex, Panorama and Amazon Q. Today, he focuses on the challenges of deploying and managing AI agents at enterprise scale that make large-scale agent deployments reliable and secure.

Bharathi Srinivasan

Bharathi Srinivasan is a Senior Generative AI Data Scientist at AWS. Bharathi works with enterprise customers on large‑scale generative AI challenges, including robustness and verification of non‑deterministic systems, governance of GenAI and agentic AI platforms, and the quality of dynamic agentic AI systems.

この記事をシェア

KDnuggets2026年6月25日 21:00

2026 年に AI エンジニアになるためのロードマップ

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

AWS Machine Learning Blog·2026年5月5日 02:13·約12分

エージェント性能ループの紹介：AgentCore Optimization がプレビュー開始

#AgentCore #LLM #A/B テスト #RAG #AWS

TL;DR

AI深層分析2026年5月5日 03:03

重要/ 5段階

深度40%

キーポイント

エージェント品質の自動最適化ループの確立

多角的な検証メカニズムの提供

大規模展開とセキュリティの強化

影響分析・編集コメントを表示

影響分析

編集コメント

*生成された推奨事項を実行トレースから導出し、バッチ評価と A/B テストで検証し、確信を持ってリリースする。*

ループの実践的な運用方法**

目指す方向

実際の動作を確認する

はじめに

著者について

image

アマンディープ・クルラナ

image

ニキル・カンドイ

image

Bharathi Srinivasan

原文を表示

*Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence.*

“Continuously evaluating and improving agents is essential for driving data-driven value creation. Processes that traditionally required weeks of manual prompt tuning have evolved into rapid, repeatable cycles through the use of AgentCore. By deriving improvement recommendations from production trace data and validating their impact through A/B testing, organizations can optimize performance while ensuring accuracy and effectiveness. This approach enables continuous, highly efficient improvement at scale.” Yoshiharu Okuda, Head of Generative AI Business Strategy Department, NTT DATA

How the loop runs in practice

Here is how the loop runs for the model upgrade scenario. The pattern is the same for any change: a prompt refactor, a tool set update, a framework upgrade.

Where we’re headed

See it in action

Get started

Get started through the AgentCore Console or CLI. Read the documentation and follow step by step tutorials here.

About the authors

Amandeep Khurana

Nikhil Kandoi

Bharathi Srinivasan

この記事をシェア

KDnuggets2026年6月25日 21:00

2026 年に AI エンジニアになるためのロードマップ

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

エージェント性能ループの紹介：AgentCore Optimization がプレビュー開始

キーポイント

影響分析

編集コメント

著者について

アマンディープ・クルラナ

ニキル・カンドイ

Bharathi Srinivasan

About the authors

Amandeep Khurana

Nikhil Kandoi

Bharathi Srinivasan

関連記事

エージェント性能ループの紹介：AgentCore Optimization がプレビュー開始

キーポイント

影響分析

編集コメント

著者について

アマンディープ・クルラナ

ニキル・カンドイ

Bharathi Srinivasan

About the authors

Amandeep Khurana

Nikhil Kandoi

Bharathi Srinivasan

関連記事