AWS Machine Learning Blog·2026年5月6日 01:54·約15分

Amazon Bedrock AgentCore Browser に OS レベルの操作機能を追加

#AgentCore #Vision AI #Automation #AWS #OS Integration

TL;DR

AWS は Amazon Bedrock AgentCore Browser に「OS レベルアクション」機能を追加し、従来の DOM ベースの自動化では不可能だった OS 固有のダイアログやショートカット操作を可能にした。

AI深層分析2026年5月6日 02:04

重要/ 5段階

深度40%

キーポイント

DOM の限界と OS レイヤーの課題

従来の Playwright や CDP は Web ドメイン内でのみ機能し、印刷ダイアログ、セキュリティプロンプト、コンテキストメニューなど OS がレンダリングするネイティブ UI にはアクセスできないという根本的な制約があった。

OS レベルアクションの導入

新しい「InvokeBrowser」API を通じて、エージェントが OS のマウスやキーボードを直接操作できる機能を追加し、画面に表示されるネイティブ UI に対して即座に行動できるようになった。

ビジョンベースエージェントの強化

スクリーンショットから AI モデルが判断を下すループにおいて、モデルが OS レベルの要素を認識した際にも、DOM を介さずに直接操作できるため、生産環境での自動化成功率が向上する。

実装と動作メカニズム

セッション ID を用いて既存の設定に追加設定なしで利用可能となり、アクション実行後に画面をキャプチャして反応を確認する「アクション - スクリーンショット - 反応」のループが確立された。

必要なIAM権限とリソース設定

ブラウザセッションを開始するには、bedrock-agentcore:InvokeBrowserなどの特定の権限を持つ実行ロールと、事前に作成されたカスタムブラウザリソースが必要です。

セッション開始時のパラメータ制御

viewPortで画面解像度を指定してマウスイベントの座標空間を定義し、sessionTimeoutSecondsでセッションの自動終了時間を設定します。

座標範囲と検証

画面の解像度に基づき有効な座標範囲（例：1920×1080ならxは0-1919）が定義されており、範囲外の値はValidationExceptionを返します。

影響分析・編集コメントを表示

影響分析

この発表は、AI エージェントが Web ブラウザ内だけでなく、OS のシステムレベルでも自律的に動作できることを実証する重要な転換点です。これにより、テスト環境では再現されにくいプロダクション特有の OS 制約やセキュリティプロンプトによる自動化失敗というボトルネックを解消し、実世界での複雑なワークフロー自動化の実用性が飛躍的に高まります。

編集コメント

従来の Web 自動化ツールの限界を打破し、AI エージェントが OS の壁を越えて自律的に行動できる道を開いた画期的なアップデートです。

ウェブワークフローを自動化する AI エージェントは、Playwright や Chrome DevTools Protocol (CDP) が公開するブラウザの Web レイヤーである DOM 内で動作します。AgentCore Browser はこれのために安全で隔離されたブラウザ環境を提供し、ページ遷移、フォーム入力、要素クリック、コンテンツ抽出など、自動化の绝大多数において効果的に機能します。しかし、Web レイヤーには明確な境界線が存在します。オペレーティングシステムが描画するあらゆるもの（ネイティブダイアログ、セキュリティプロンプト、証明書選択画面、コンテキストメニュー、さらには Chrome の設定自体）は、DOM から完全に外側に位置しています。CDP はそれらを見ることができず、Playwright もそれらと対話することができません。

Web アプリケーションが window.print() を呼び出してシステム印刷ダイアログが表示された場合、Playwright に対話できる DOM は存在しません。ワークフローにキーボードショートカットや右クリックコンテキストメニューが必要な場合、CDP には OS レベルでこれらのコマンドを発行するメカニズムがありません。ブラウザセッションが macOS のプライバシーダイアログ、Windows セキュリティプロンプト、または証明書選択画面に遭遇した場合、それらは Web 自動化レイヤーからは見えない状態となります。これらのシナリオは本番環境で頻繁に発生します。これらは特定のアプリケーションの状態、OS 設定、またはユーザーの権限によってトリガーされるものであり、検証対象となるウェブコンテンツが予測可能であるテスト環境では通常発生しません。

ビジョン対応エージェントにおいては、課題はさらに複雑化します。一般的なアーキテクチャとしては、スクリーンショットをキャプチャしてモデルに送信し、座標や指示を受け取り、それを実行するというループが採用されます。このループは Web コンテンツに対してはうまく機能しますが、ネイティブ UI が表示された瞬間に破綻します。スクリーンショットにはそれが捉えられていますが、モデルはその内容を推論したとしても、実行する手段がありません。CDP（Chrome DevTools Protocol）では OS が描画した領域に到達できないのです。エージェントは何をすべきか正確に見ていても、それを実行する方法がありません。

私たちは AgentCore Browser における「OS レベルアクション」を発表します。この新機能は、InvokeBrowser API を通じて直接 OS コントロールを公開することで、これらのシナリオのボトルネックを解消し、エージェントがブラウザの Web レイヤーを通じてアクセス可能なものだけでなく、画面上に表示されているコンテンツとも対話できるようにします。フルデスクトップスクリーンショットと、OS レベルでのマウス・キーボード制御を組み合わせることで、エージェントはネイティブ UI を観察し、それについて推論し、同じセッション内でそれに対して行動を起こすことが可能になります。本稿では、OS レベルアクションの仕組み、サポートされるアクションの種類、および開始方法について解説します。

OS レベルアクションの仕組み

OS レベルアクションは、追加の設定なしに新規および既存のブラウザ設定で利用可能です。セッションがアクティブになった後、InvokeBrowser API を介してアクションをディスパッチします。各呼び出しでは、タイプと引数によって識別される exactly 1 つのアクションが行われ、SUCCESS または FAILED のステータスが返されます。アクティブなセッションは、x-amzn-browser-session-id ヘッダーを使用して特定され、これにより各 OS レベルアクションが正しいブラウザセッションに紐付けられます。

想定されるインタラクションパターンは、アクション・スクリーンショット・反応のループです。エージェントはアクション（クリック、入力、ショートカット）を実行し、画面の現在の状態を観察するためにスクリーンショットをキャプチャし、その後に見た内容に基づいて次のアクションを決定します。このループにより、エージェントは動的な UI に反応できるようになります。これには、ワークフロー中に現れる可能性のあるネイティブダイアログや OS プロンプトも含まれます。

エージェントがアクションを送信します。これは InvokeBrowser を使用したマウスクリック、キー入力、またはショートカットです。
AgentCore がフル OS デスクトップ上でアクションを実行し、SUCCESS または FAILED を返します。
エージェントは現在の画面状態を観察するためにスクリーンショットを要求します。
AgentCore はネイティブダイアログ、OS モーダル、ブラウザウィンドウ外の UI を含むフルデスクトップをキャプチャし、base64 符号化された PNG を返します。
エージェントはスクリーンショットについて推論を行い、それをビジョンモデル（vision model）に送信して何が起こったか、そして次に何をすべきかを判断します。
エージェントは観察した内容に基づいて次のアクションを送信し、ループを続けます。

サポートされているアクション

OS レベルのアクションは、マウス制御、キーボード入力、視覚キャプチャの 3 つのカテゴリに整理されています。以下の表は、8 つのアクションとそのフィールドおよび制約を要約したものです。

Action	Required fields	Optional fields	Notes

| mouseClick | — | x, y, button, clickCount | デフォルトは現在の位置、LEFT、シングルクリック。clickCount: 1–10。

| mouseMove | x, y | — | カーソルを指定座標へ移動します。

mouseDrag

endX, endY

startX, startY, button

開始点から終了点へドラッグします。button はデフォルトで LEFT です。

mouseScroll

—

x, y, deltaX, deltaY

deltaY が負の値 = 下方向スクロール。範囲：-1000 から 1000。

keyType

text

—

文字列を入力します。最大 10,000 文字。

keypress

key

presses

キーを N 回押します。presses は 1–100 で、デフォルトは 1 です。

keyShortcut

keys

—

キーの組み合わせ配列です。最大 5 キーまで。例：["ctrl", "a"]。

screenshot

—

format

OS デスクトップ全体をキャプチャします。base64 でエンコードされた PNG を返します。

マウス操作

マウス操作には、クリック、移動、ドラッグ、スクロールなど、ポインタに関するすべての相互作用が含まれます。mouseClick の座標フィールドは省略可能です。省略した場合、クリックは現在のカーソル位置に左ボタンで単一クリックとして実行されます。これは、先行する mouseMove で既にカーソルが配置されている場合に有用です。mouseDrag には、開始点と終了点の 4 つの座標が必要です。mouseScroll は、両軸に対する位置とデルタ値を受け付けます。deltaY が負の値の場合、下方向にスクロールし、正の値の場合は上方向にスクロールします。例えば、右クリックによるコンテキストメニューは、ターゲット座標で button を RIGHT に設定した単一の mouseClick です。ただし、ブラウザセッションが実行される仮想化環境のため、一部のコンテキストメニュー項目は期待通りに機能しない場合があります。

キーボード操作

3 つのキーボード操作は、異なるレベルの入力をカバーしています。keyType はテキスト入力用です。文字を直接送信し、最大 10,000 文字の文字列を処理します。keypress は、タブキーでフォームフィールドを進めたり、エスケープキーでモーダルを閉じたりするなど、繰り返し押す必要がある個別のキー用です。keyShortcut は組み合わせ操作用で、キー名の配列を渡すと、AgentCore がそれらを同時に押します。

keypress と keyShortcut のキー名はすべて小文字である必要があります。サポートされているキーには、単一文字 (a–z, 0–9) と、enter、tab、space、backspace、delete、escape、ctrl、alt、shift などの名前付きキーが含まれます。

例えば、すべてのテキストを選択するには、keyShortcut を ["ctrl", "a"] で使用します。

{

"action": {

"keyShortcut": {

"keys": ["ctrl", "a"]

}

スクリーンショット

スクリーンショット操作は、OS のデスクトップ全体をキャプチャし、レスポンスで base64 でエンコードされた PNG を返します。これはデータを返す唯一の操作です。他の操作は、成功または失敗を示すステータスと、失敗時のエラーフィールドのみを返します。

{

"action":{

"screenshot":{

"format":"PNG"

}

はじめに

以下の例では、コンパニオンノートブックに合わせて、操作 - スクリーンショット - 反応のループを順を追って説明します。8 つの操作をエンドツーエンドでデモンストレーションした完全な動作中のノートブックについては、そちらから始めてください。

クライアントの設定とブラウザの作成

ブラウザリソースを管理するためのコントロールプレーンクライアント (bedrock-agentcore-control) と、セッション中にアクションを実行するためのデータプレーンクライアント (bedrock-agentcore) の 2 つのクライアントが必要です。

import boto3

import time

browser_boto3 = boto3.client('bedrock-agentcore-control', region_name='us-west-2')

BROWSER_NAME = "browser_with_os_actions"

セッションを開始する前に、AWS Identity and Access Management (IAM) 実行ロールとブラウザリソースが必要です。実行ロールには、bedrock-agentcore:InvokeBrowser、bedrock-agentcore:StartBrowserSession、および bedrock-agentcore:StopBrowserSession の権限が必要です。コンパニオンノートブックには、このロールを自動的に作成するヘルパーが含まれています:

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

ロールが作成されたら、カスタムブラウザを作成します:

created_browser = browser_boto3.create_browser(

name=BROWSER_NAME,

executionRoleArn=execution_role_arn,

networkConfiguration={

'networkMode': 'PUBLIC'

}

)

browser_id = created_browser['browserId']

print(f"Browser ID: {browser_id}")

ブラウザセッションを開始する

ブラウザリソースが作成されたら、セッションを開始します。viewPort は画面解像度を設定し、これによりマウス操作の座標空間とキャプチャされるスクリーンショットの寸法が決まります。sessionTimeoutSeconds は、セッションが自動的に終了するまでの生存時間を制御します。

これらのヘルパー関数は、コンパニオンノートブックリポジトリに含まれています

from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()

BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, region=default_region, credentials=creds)

セッションの初期化を待機 — 環境に応じて必要に応じて調整してください

time.sleep(3)

start_session ヘルパーは、セッションを作成するために SigV4署名付き PUT リクエストを送信し、sessionId を返します。invoke ヘルパーは署名処理と個々のアクションのディスパッチを担当します。

OS レベルのアクションを呼び出す

セッションが実行されている状態で、invoke ヘルパーを通じて OS レベルのアクションをディスパッチできます。各呼び出しでは単一のアクションを受け取り、この場合は画面の座標 (600, 370) で左クリックを行うアクションです:

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},

region=default_region, credentials=creds, browser_id=browser_id

)

print(f"マウスクリックステータス：{r.status_code}, アクション：{r.json()['result']}")

このレスポンスは、アクションが成功したか失敗したかを教えてくれます。座標は画面のピクセルに対応しており、セッションのビューポートが 1920×1080 の場合、有効な x 値の範囲は 0 から 1919、y は 0 から 1079 です。画面の寸法を超える座標を指定すると、ValidationException が返されます。

スクリーンショットのキャプチャ

各アクションの後、エージェントは発生したことを観察する必要があります。スクリーンショットアクションではデスクトップ全体がキャプチャされ、画像は base64 符号化された PNG として返されます：

import base64

from IPython.display import Image, display

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"screenshot": {"format": "PNG"}},

region=default_region, credentials=creds, browser_id=browser_id

)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])

display(Image(img_bytes))

これはループ内の観測ステップです。エージェントはスクリーンショットをビジョンモデルに送信し、画面にあるものを推論させて次のアクションを決定させます。このサイクルはワークフローが完了するまで繰り返されます。

実践：印刷ダイアログの閉じ方

ここでは、アクション・スクリーンショット・反応のループを実践で示します。エージェントが window.print() をトリガーするページにナビゲートし、ネイティブな印刷ダイアログが表示されたと仮定しましょう。エージェントは CDP（Chrome DevTools Protocol）を通じてこれに対処できませんが、OS レベルアクション（OS Level Actions）であれば可能です。まず、エージェントは画面の現在の状態を確認するためにスクリーンショットをキャプチャします：

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"screenshot": {"format": "PNG"}},

region=default_region, credentials=creds, browser_id=browser_id

)

スクリーンショットをビジョンモデルに送信し、ダイアログを特定してキャンセルボタンを検出します。

ビジョンモデルとの統合はエージェントのアーキテクチャに依存します — 画像を Claude や他のモデルに送信する方法については、Bedrock InvokeModel API を参照してください。

モデルは座標を返します。例：{"x": 410, "y": 535}

ビジョンモデルが印刷ダイアログを特定し、キャンセルボタンの座標を返します。エージェントがこれを選択します:

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},

region=default_region, credentials=creds, browser_id=browser_id

)

print(f"クリックステータス：{r.status_code}, アクション：{r.json()['result']}")

エージェントはダイアログが閉じられたことを確認するためにもう一度スクリーンショットを取得し、ワークフローは続行されます。

セッションの停止とリソースのクリーンアップ

ワークフローが完了したら、セッションを停止してリソースをクリーンアップします:

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, region=default_region, credentials=creds)

ブラウザリソースと IAM ロールを削除するには:

browser_boto3.delete_browser(browserId=browser_id)

print(f"ブラウザ {browser_id} を削除しました")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME

delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

これらのステップ、act（行動）、observe（観察）、decide（判断）は、アクション・スクリーンショット・リアクションパターンの中核を成すものです。コンパニオンノートブックでは、マウスドラッグ、スクロール、キーボード入力、ショートカット組み合わせなどを含む 8 つのサポート済みアクションを、ライブブラウザセッションを通じて解説しています。

結論

AWS Bedrock AgentCore Browser をリリースした際、AI エージェントに対して、ウェブサイトを操作するための完全に管理されたクラウドベースのブラウザ環境を提供しました。これは Playwright と CDP（Chrome DevTools Protocol）を通じてページをナビゲートし、コンテンツを抽出し、大規模なワークフローを自動化するものでした。OS レベルアクションは、この機能をウェブレイヤーを超えて、画面上に表示される UI 要素へと拡張します。ネイティブダイアログ、セキュリティプロンプト、キーボードショートカット、ブラウザのクローム（UI 枠組み）がもはや障害となることはありません。エージェントは、同じセッション内で OS デスクトップ全体を観察し、推論し、行動できるようになりました。

視覚的理解や Playwright および Amazon Nova Act とのフレームワーク統合といった、AgentCore Browser の既存機能と組み合わせることで、OS レベルアクションはブラウザ自動化のカバレッジにおける最後のギャップを埋めます。

構築を開始するには:

Amazon Bedrock AgentCore の開発者ガイドに従ってください
ハンドオンウォークスルーには、対応するノートブックをお試しください
ブラウザ自動化に関するより広い文脈については、Amazon Bedrock AgentCore Browser のドキュメントをご覧ください

著者について

image

Evandro Franco

Evandro Franco氏は、Amazon Web Services で働くシニアデータサイエンティストです。AWS 上の AI/ML、特に Amazon Bedrock AgentCore や Strands Agents に関連するビジネス課題の解決を AWS カスタマーに支援するグローバル GTM チームの一員です。ソフトウェア開発、インフラストラクチャ、サーバーレス、機械学習に至るまで、18 年以上にわたり技術分野で活動してきました。趣味では、主に面白いレゴブロックを組み立てながら息子と遊ぶことを楽しんでいます。

image

Phelipe Fabres

Phelipe Fabres氏は、AWS for Startups における生成 AI のシニアソリューションアーキテクトです

原文を表示

AI agents that automate web workflows operate within the browser’s web layer, the DOM that Playwright and the Chrome DevTools Protocol (CDP) expose. AgentCore Browser provides a secure, isolated browser environment for this, and it works well for the vast majority of automation: navigating pages, filling forms, clicking elements, extracting content. But the web layer has a hard boundary. Anything that the operating system renders (native dialogs, security prompts, certificate choosers, context menus, even Chrome settings) sits outside the DOM entirely. CDP can’t see it, and Playwright can’t interact with it.

When a web application calls window.print() and a system print dialog appears, Playwright has no DOM to interact with. When a workflow requires a keyboard shortcut or a right-click context menu, CDP has no mechanism to issue those commands at the OS level. When a browser session encounters a macOS privacy dialog, a Windows Security prompt, or a certificate chooser, they’re invisible to the web automation layer. These scenarios tend to surface in production. They’re triggered by specific application states, OS configurations, or user permissions, not in testing, where web content is predictable enough to validate against.

The challenge compounds for vision-enabled agents. A common architecture is to capture a screenshot, send it to a model, receive back coordinates or instructions, and execute. This loop works well for web content, but breaks the moment that native UI appears. The screenshot captures it, the model reasons about it, and then there’s nothing to act with. CDP can’t reach what the OS rendered. The agent sees exactly what to do and has no way to do it.

We’re announcing OS Level Actions for AgentCore Browser. This new capability unblocks these scenarios by exposing direct OS control through the InvokeBrowser API, so agents can interact with content visible on the screen, not only what’s accessible through the browser’s web layer. By combining full-desktop screenshots with mouse and keyboard control at the OS level, agents can observe native UI, reason about it, and act on it within the same session. This post walks through how OS Level Actions work, what actions are supported, and how to get started.

How OS Level Actions work

OS Level Actions are available for new and existing browser configurations without further setup. After a session is active, you dispatch actions through the InvokeBrowser API. Each call carries exactly one action, identified by its type and arguments, and returns a SUCCESS or FAILED status. The active session is identified using the x-amzn-browser-session-id header, which ties each OS-level action to the correct browser session.

The expected interaction pattern is an action-screenshot-reaction loop. The agent takes an action (click, type, shortcut), captures a screenshot to observe the current state of the screen, and then decides the next action based on what it sees. This loop allows the agent to react to dynamic UI. This includes native dialogs and OS prompts that might appear mid-workflow.

Agent sends an action. This can be a mouse click, key press, or shortcut using InvokeBrowser.

AgentCore executes the action on the full OS desktop and returns SUCCESS or FAILED.

Agent requests a screenshot to observe the current screen state.

AgentCore captures the full desktop, including native dialogs, OS modals, and UI outside the browser window, and returns a base64-encoded PNG.

Agent reasons about the screenshot sending it to a vision model to determine what happened and what to do next.

Agent sends the next action based on what it observed, continuing the loop.

Supported actions

OS Level Actions are organized into three categories: mouse control, keyboard input, and visual capture. The following table summarizes eight actions with their fields and constraints.

Action

Required fields

Optional fields

Notes

mouseClick

—

x, y, button, clickCount

Defaults to current position, LEFT, single click. clickCount: 1–10.

mouseMove

x, y

—

Moves cursor to coordinates.

mouseDrag

endX, endY

startX, startY, button

Drags from start to end. button defaults to LEFT.

mouseScroll

—

x, y, deltaX, deltaY

deltaY negative = scroll down. Range: -1000 to 1000.

keyType

text

—

Types a string. Max 10,000 characters.

keyPress

key

presses

Presses a key N times. presses: 1–100, defaults to 1.

keyShortcut

keys

—

Key combination array. Up to five keys, for example, [“ctrl”, “a”].

screenshot

—

format

Captures full OS desktop. Returns base64-encoded PNG.

Mouse actions

Mouse actions cover the full range of pointer interactions: clicking, moving, dragging, and scrolling. Coordinate fields are optional for mouseClick. If omitted, the click lands at the current cursor position with a left button single click. This is useful when a prior mouseMove has already positioned the cursor. mouseDrag requires the four coordinates, start and end positions. mouseScroll accepts a position and delta values for both axes—negative deltaY scrolls down, positive scrolls up. A right-click context menu, for example, is a single mouseClick with button set to RIGHT at the target coordinates. Note that some context menu items might not function as expected because of the virtualized environment in which the browser session runs.

Keyboard actions

The three keyboard actions cover different levels of input. keyType is for typing text. It sends characters directly and handles strings up to 10,000 characters. keyPress is for individual keys that must be pressed repeatedly, such as tab to advance through form fields or escape to dismiss a modal. keyShortcut is for combinations—pass an array of key names and AgentCore presses them simultaneously.

Key names for keyPress and keyShortcut must be lowercase. Supported keys include single characters (a–z, 0–9) and named keys such as enter, tab, space, backspace, delete, escape, ctrl, alt, and shift.

To select the entire text, for example, you would use keyShortcut with ["ctrl", "a"].

code

{
  "action": {
    "keyShortcut": {
      "keys": ["ctrl", "a"]
    }
  }
}

Screenshot

The screenshot action captures the full OS desktop and returns a base64-encoded PNG in the response. It’s the only action that returns data. The other actions return only a status (SUCCESS or FAILED) and an error field on failure.

code

{
   "action":{
      "screenshot":{
         "format":"PNG"
      }
   }
}

Getting started

The following examples walk through the action-screenshot-reaction loop, matching the companion notebook. For the full working notebook with eight actions demonstrated end to end, start there.

Set up clients and create a browser

You need two clients: a control plane client (bedrock-agentcore-control) for managing browser resources, and a data plane client (bedrock-agentcore) for dispatching actions during a session.

code

import boto3
import time

browser_boto3 = boto3.client('bedrock-agentcore-control', region_name='us-west-2')

BROWSER_NAME = "browser_with_os_actions"

Before starting a session, you need an AWS Identity and Access Management (IAM) execution role and a browser resource. The execution role requires bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession permissions. The companion notebook includes a helper that creates this role for you:

code

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

With the role created, create a custom browser:

code

created_browser = browser_boto3.create_browser(
    name=BROWSER_NAME,
    executionRoleArn=execution_role_arn,
    networkConfiguration={
        'networkMode': 'PUBLIC'
    }
)

browser_id = created_browser['browserId']
print(f"Browser ID: {browser_id}")

Start a browser session

With the browser resource created, start a session. The viewPort sets the screen resolution. This determines the coordinate space for mouse actions and the dimensions of captured screenshots. The sessionTimeoutSeconds controls how long the session stays alive before it’s automatically terminated.

code

# These helpers are included in the companion notebook repository
from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()
BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, region=default_region, credentials=creds)

# Wait for session to initialize — adjust if needed for your environment
time.sleep(3)

The start_session helper sends a SigV4-signed PUT request to create the session and returns the sessionId. The invoke helper handles signing and dispatching individual actions.

Invoke an OS-level action

With the session running, you can dispatch OS-level actions through the invoke helper. Each call takes a single action — in this case, a left click at coordinates (600, 370) on the screen:

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

print(f"Mouse click status: {r.status_code}, action: {r.json()['result']}")

The response tells you whether the action succeeded or failed. Coordinates map to screen pixels, if the session viewport is 1920×1080, valid x values range from 0 to 1919 and y from 0 to 1079. Coordinates outside the screen dimensions return a ValidationException.

Capture a screenshot

After each action, the agent must observe what happened. The screenshot action captures the full desktop and returns the image as a base64-encoded PNG:

code

import base64
from IPython.display import Image, display

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])
display(Image(img_bytes))

This is the observation step in the loop. The agent sends the screenshot to a vision model, which reasons about what’s on screen and returns the next action to take. The cycle repeats until the workflow is complete.

Putting it together: dismissing a print dialog

Here is the action-screenshot-reaction loop in practice. Suppose the agent navigates to a page that triggers window.print(), and a native print dialog appears. The agent can’t interact with it through CDP, but it can with OS Level Actions.First, the agent captures a screenshot to see the current state of the screen:

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

# Send the screenshot to a vision model to identify the dialog and locate the Cancel button.
# The vision model integration depends on your agent architecture — see the Bedrock
# InvokeModel API for how to send images to Claude or other models.
# The model returns coordinates, e.g.: {"x": 410, "y": 535}

The vision model identifies the print dialog and returns the coordinates of the Cancel button. The agent selects it:

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

print(f"Click status: {r.status_code}, action: {r.json()['result']}")

The agent takes another screenshot to confirm that the dialog was dismissed, and the workflow continues.

Stop the session and clean up

When the workflow is done, stop the session and clean up resources:

code

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, region=default_region, credentials=creds)

To delete the browser resource and IAM role:

code

browser_boto3.delete_browser(browserId=browser_id)
print(f"Browser {browser_id} deleted")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME
delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

These steps, act, observe, decide, form the core of the action-screenshot-reaction pattern. The companion notebook walks through eight supported actions with a live browser session, including mouse drag, scroll, keyboard input, and shortcut combinations.

Conclusion

When we launched Amazon Bedrock AgentCore Browser, it gave AI agents a fully managed, cloud-based browser environment to interact with websites. It navigated pages, extracted content, and automated workflows at scale through Playwright and CDP. OS Level Actions extend that capability beyond the web layer to UI elements visible on the screen. Native dialogs, security prompts, keyboard shortcuts, and browser chrome are no longer blockers. Agents can now observe, reason about, and act on the full OS desktop within the same session.

Combined with AgentCore Browser’s existing capabilities like visual understanding and framework integration with Playwright and Amazon Nova Act, OS Level Actions close the last gap in browser automation coverage.

To start building:

Follow the Amazon Bedrock AgentCore Developer Guide

Try the companion notebook for a hands-on walkthrough

For broader context on browser automation, see the Amazon Bedrock AgentCore Browser documentation

About the authors

Evandro Franco

Evandro Franco is a Sr. Data Scientist working on Amazon Web Services. He is part of the Global GTM team that helps AWS customers overcome business challenges related to AI/ML on top of AWS, mainly on Amazon Bedrock AgentCore and Strands Agents. He has more than 18 years of experience working with technology, from software development, infrastructure, serverless, to machine learning. In his free time, Evandro enjoys playing with his son, mainly building some funny Lego bricks.

Phelipe Fabres

Phelipe Fabres is a Sr. Solutions Architect for Generative AI at AWS for Start

この記事をシェア

Vercel Blog2026年6月26日 17:00

Vercel CLI から Web アナリティクスデータを照会可能に

TechCrunch AI重要度42026年6月25日 21:00

Amazon、インドでの AI インフラへの新たな 130 億ドル投資で賭けを強化

TLDR AI重要度42026年6月25日 09:00

Gemini 3.5 Flash にコンピュータ操作機能を導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

AWS Machine Learning Blog·2026年5月6日 01:54·約15分

Amazon Bedrock AgentCore Browser に OS レベルの操作機能を追加

#AgentCore #Vision AI #Automation #AWS #OS Integration

TL;DR

AI深層分析2026年5月6日 02:04

重要/ 5段階

深度40%

キーポイント

DOM の限界と OS レイヤーの課題

OS レベルアクションの導入

ビジョンベースエージェントの強化

実装と動作メカニズム

必要なIAM権限とリソース設定

セッション開始時のパラメータ制御

viewPortで画面解像度を指定してマウスイベントの座標空間を定義し、sessionTimeoutSecondsでセッションの自動終了時間を設定します。

座標範囲と検証

画面の解像度に基づき有効な座標範囲（例：1920×1080ならxは0-1919）が定義されており、範囲外の値はValidationExceptionを返します。

影響分析・編集コメントを表示

影響分析

編集コメント

従来の Web 自動化ツールの限界を打破し、AI エージェントが OS の壁を越えて自律的に行動できる道を開いた画期的なアップデートです。

OS レベルアクションの仕組み

エージェントがアクションを送信します。これは InvokeBrowser を使用したマウスクリック、キー入力、またはショートカットです。
AgentCore がフル OS デスクトップ上でアクションを実行し、SUCCESS または FAILED を返します。
エージェントは現在の画面状態を観察するためにスクリーンショットを要求します。
AgentCore はネイティブダイアログ、OS モーダル、ブラウザウィンドウ外の UI を含むフルデスクトップをキャプチャし、base64 符号化された PNG を返します。
エージェントはスクリーンショットについて推論を行い、それをビジョンモデル（vision model）に送信して何が起こったか、そして次に何をすべきかを判断します。
エージェントは観察した内容に基づいて次のアクションを送信し、ループを続けます。

サポートされているアクション

Action	Required fields	Optional fields	Notes

| mouseClick | — | x, y, button, clickCount | デフォルトは現在の位置、LEFT、シングルクリック。clickCount: 1–10。

| mouseMove | x, y | — | カーソルを指定座標へ移動します。

mouseDrag

endX, endY

startX, startY, button

開始点から終了点へドラッグします。button はデフォルトで LEFT です。

mouseScroll

—

x, y, deltaX, deltaY

deltaY が負の値 = 下方向スクロール。範囲：-1000 から 1000。

keyType

text

—

文字列を入力します。最大 10,000 文字。

keypress

key

presses

キーを N 回押します。presses は 1–100 で、デフォルトは 1 です。

keyShortcut

keys

—

キーの組み合わせ配列です。最大 5 キーまで。例：["ctrl", "a"]。

screenshot

—

format

OS デスクトップ全体をキャプチャします。base64 でエンコードされた PNG を返します。

マウス操作

キーボード操作

例えば、すべてのテキストを選択するには、keyShortcut を ["ctrl", "a"] で使用します。

{

"action": {

"keyShortcut": {

"keys": ["ctrl", "a"]

}

スクリーンショット

{

"action":{

"screenshot":{

"format":"PNG"

}

はじめに

クライアントの設定とブラウザの作成

import boto3

import time

browser_boto3 = boto3.client('bedrock-agentcore-control', region_name='us-west-2')

BROWSER_NAME = "browser_with_os_actions"

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

ロールが作成されたら、カスタムブラウザを作成します:

created_browser = browser_boto3.create_browser(

name=BROWSER_NAME,

executionRoleArn=execution_role_arn,

networkConfiguration={

'networkMode': 'PUBLIC'

}

)

browser_id = created_browser['browserId']

print(f"Browser ID: {browser_id}")

ブラウザセッションを開始する

これらのヘルパー関数は、コンパニオンノートブックリポジトリに含まれています

from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()

BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, region=default_region, credentials=creds)

セッションの初期化を待機 — 環境に応じて必要に応じて調整してください

time.sleep(3)

OS レベルのアクションを呼び出す

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},

region=default_region, credentials=creds, browser_id=browser_id

)

print(f"マウスクリックステータス：{r.status_code}, アクション：{r.json()['result']}")

スクリーンショットのキャプチャ

import base64

from IPython.display import Image, display

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"screenshot": {"format": "PNG"}},

region=default_region, credentials=creds, browser_id=browser_id

)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])

display(Image(img_bytes))

実践：印刷ダイアログの閉じ方

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"screenshot": {"format": "PNG"}},

region=default_region, credentials=creds, browser_id=browser_id

)

スクリーンショットをビジョンモデルに送信し、ダイアログを特定してキャンセルボタンを検出します。

ビジョンモデルとの統合はエージェントのアーキテクチャに依存します — 画像を Claude や他のモデルに送信する方法については、Bedrock InvokeModel API を参照してください。

モデルは座標を返します。例：{"x": 410, "y": 535}

ビジョンモデルが印刷ダイアログを特定し、キャンセルボタンの座標を返します。エージェントがこれを選択します:

r = invoke(

BEDROCK_AGENTCORE_DP_ENDPOINT, sid,

{"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},

region=default_region, credentials=creds, browser_id=browser_id

)

print(f"クリックステータス：{r.status_code}, アクション：{r.json()['result']}")

エージェントはダイアログが閉じられたことを確認するためにもう一度スクリーンショットを取得し、ワークフローは続行されます。

セッションの停止とリソースのクリーンアップ

ワークフローが完了したら、セッションを停止してリソースをクリーンアップします:

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, region=default_region, credentials=creds)

ブラウザリソースと IAM ロールを削除するには:

browser_boto3.delete_browser(browserId=browser_id)

print(f"ブラウザ {browser_id} を削除しました")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME

delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

結論

構築を開始するには:

Amazon Bedrock AgentCore の開発者ガイドに従ってください
ハンドオンウォークスルーには、対応するノートブックをお試しください
ブラウザ自動化に関するより広い文脈については、Amazon Bedrock AgentCore Browser のドキュメントをご覧ください

著者について

image

Evandro Franco

image

Phelipe Fabres

Phelipe Fabres氏は、AWS for Startups における生成 AI のシニアソリューションアーキテクトです

原文を表示

How OS Level Actions work

Agent sends an action. This can be a mouse click, key press, or shortcut using InvokeBrowser.

AgentCore executes the action on the full OS desktop and returns SUCCESS or FAILED.

Agent requests a screenshot to observe the current screen state.

AgentCore captures the full desktop, including native dialogs, OS modals, and UI outside the browser window, and returns a base64-encoded PNG.

Agent reasons about the screenshot sending it to a vision model to determine what happened and what to do next.

Agent sends the next action based on what it observed, continuing the loop.

Supported actions

OS Level Actions are organized into three categories: mouse control, keyboard input, and visual capture. The following table summarizes eight actions with their fields and constraints.

Action

Required fields

Optional fields

Notes

mouseClick

—

x, y, button, clickCount

Defaults to current position, LEFT, single click. clickCount: 1–10.

mouseMove

x, y

—

Moves cursor to coordinates.

mouseDrag

endX, endY

startX, startY, button

Drags from start to end. button defaults to LEFT.

mouseScroll

—

x, y, deltaX, deltaY

deltaY negative = scroll down. Range: -1000 to 1000.

keyType

text

—

Types a string. Max 10,000 characters.

keyPress

key

presses

Presses a key N times. presses: 1–100, defaults to 1.

keyShortcut

keys

—

Key combination array. Up to five keys, for example, [“ctrl”, “a”].

screenshot

—

format

Captures full OS desktop. Returns base64-encoded PNG.

Mouse actions

Keyboard actions

To select the entire text, for example, you would use keyShortcut with ["ctrl", "a"].

code

{
  "action": {
    "keyShortcut": {
      "keys": ["ctrl", "a"]
    }
  }
}

Screenshot

code

{
   "action":{
      "screenshot":{
         "format":"PNG"
      }
   }
}

Getting started

The following examples walk through the action-screenshot-reaction loop, matching the companion notebook. For the full working notebook with eight actions demonstrated end to end, start there.

Set up clients and create a browser

You need two clients: a control plane client (bedrock-agentcore-control) for managing browser resources, and a data plane client (bedrock-agentcore) for dispatching actions during a session.

code

import boto3
import time

browser_boto3 = boto3.client('bedrock-agentcore-control', region_name='us-west-2')

BROWSER_NAME = "browser_with_os_actions"

code

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

With the role created, create a custom browser:

code

created_browser = browser_boto3.create_browser(
    name=BROWSER_NAME,
    executionRoleArn=execution_role_arn,
    networkConfiguration={
        'networkMode': 'PUBLIC'
    }
)

browser_id = created_browser['browserId']
print(f"Browser ID: {browser_id}")

Start a browser session

code

# These helpers are included in the companion notebook repository
from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()
BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, region=default_region, credentials=creds)

# Wait for session to initialize — adjust if needed for your environment
time.sleep(3)

The start_session helper sends a SigV4-signed PUT request to create the session and returns the sessionId. The invoke helper handles signing and dispatching individual actions.

Invoke an OS-level action

With the session running, you can dispatch OS-level actions through the invoke helper. Each call takes a single action — in this case, a left click at coordinates (600, 370) on the screen:

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

print(f"Mouse click status: {r.status_code}, action: {r.json()['result']}")

Capture a screenshot

After each action, the agent must observe what happened. The screenshot action captures the full desktop and returns the image as a base64-encoded PNG:

code

import base64
from IPython.display import Image, display

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])
display(Image(img_bytes))

Putting it together: dismissing a print dialog

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

# Send the screenshot to a vision model to identify the dialog and locate the Cancel button.
# The vision model integration depends on your agent architecture — see the Bedrock
# InvokeModel API for how to send images to Claude or other models.
# The model returns coordinates, e.g.: {"x": 410, "y": 535}

The vision model identifies the print dialog and returns the coordinates of the Cancel button. The agent selects it:

code

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},
    region=default_region, credentials=creds, browser_id=browser_id
)

print(f"Click status: {r.status_code}, action: {r.json()['result']}")

The agent takes another screenshot to confirm that the dialog was dismissed, and the workflow continues.

Stop the session and clean up

When the workflow is done, stop the session and clean up resources:

code

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, region=default_region, credentials=creds)

To delete the browser resource and IAM role:

code

browser_boto3.delete_browser(browserId=browser_id)
print(f"Browser {browser_id} deleted")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME
delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

Conclusion

To start building:

Follow the Amazon Bedrock AgentCore Developer Guide

Try the companion notebook for a hands-on walkthrough

For broader context on browser automation, see the Amazon Bedrock AgentCore Browser documentation

About the authors

Evandro Franco

Phelipe Fabres

Phelipe Fabres is a Sr. Solutions Architect for Generative AI at AWS for Start

この記事をシェア

Vercel Blog2026年6月26日 17:00

Vercel CLI から Web アナリティクスデータを照会可能に

TechCrunch AI重要度42026年6月25日 21:00

Amazon、インドでの AI インフラへの新たな 130 億ドル投資で賭けを強化

TLDR AI重要度42026年6月25日 09:00

Gemini 3.5 Flash にコンピュータ操作機能を導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

OS レベルアクションの仕組み

サポートされているアクション

マウス操作

キーボード操作

スクリーンショット

はじめに

クライアントの設定とブラウザの作成

ブラウザセッションを開始する

これらのヘルパー関数は、コンパニオンノートブックリポジトリに含まれています

セッションの初期化を待機 — 環境に応じて必要に応じて調整してください

OS レベルのアクションを呼び出す

スクリーンショットのキャプチャ

実践：印刷ダイアログの閉じ方

スクリーンショットをビジョンモデルに送信し、ダイアログを特定してキャンセルボタンを検出します。

ビジョンモデルとの統合はエージェントのアーキテクチャに依存します — 画像を Claude や他のモデルに送信する方法については、Bedrock InvokeModel API を参照してください。

モデルは座標を返します。例：{"x": 410, "y": 535}

セッションの停止とリソースのクリーンアップ

結論

Evandro Franco

Phelipe Fabres

How OS Level Actions work

Supported actions

Mouse actions

Keyboard actions

Screenshot

Getting started

Set up clients and create a browser

Start a browser session

Invoke an OS-level action

Capture a screenshot

Putting it together: dismissing a print dialog

Stop the session and clean up

Conclusion

Evandro Franco

Phelipe Fabres

関連記事

キーポイント

影響分析

編集コメント

OS レベルアクションの仕組み

サポートされているアクション

マウス操作

キーボード操作

スクリーンショット

はじめに

クライアントの設定とブラウザの作成

ブラウザセッションを開始する

これらのヘルパー関数は、コンパニオンノートブックリポジトリに含まれています

セッションの初期化を待機 — 環境に応じて必要に応じて調整してください

OS レベルのアクションを呼び出す

スクリーンショットのキャプチャ

実践：印刷ダイアログの閉じ方

スクリーンショットをビジョンモデルに送信し、ダイアログを特定してキャンセルボタンを検出します。

ビジョンモデルとの統合はエージェントのアーキテクチャに依存します — 画像を Claude や他のモデルに送信する方法については、Bedrock InvokeModel API を参照してください。

モデルは座標を返します。例：{"x": 410, "y": 535}

セッションの停止とリソースのクリーンアップ

結論

Evandro Franco

Phelipe Fabres

How OS Level Actions work

Supported actions

Mouse actions

Keyboard actions

Screenshot

Getting started

Set up clients and create a browser

Start a browser session

Invoke an OS-level action

Capture a screenshot

Putting it together: dismissing a print dialog

Stop the session and clean up

Conclusion

Evandro Franco

Phelipe Fabres

関連記事