AWS Machine Learning Blog·2026年3月13日 06:20·約2分

Amazon Bedrockの推論ワークロード向け新CloudWatchメトリクスでTTFTと推定クォータ消費量の可視性を向上

#生成AI #クラウドインフラ #運用監視 #AWS #モデル推論 #レイテンシ最適化

TL;DR

AWSはAmazon Bedrockの推論ワークロード向けに、ストリーミング応答の初期待ち時間と推定クォータ消費量を可視化する2つの新CloudWatchメトリクス「TimeToFirstToken」と「EstimatedTPMQuotaUsage」を発表した。

AI深層分析2026年3月13日 07:42

注目/ 5段階

深度40%

キーポイント

新メトリクスの導入目的

生成AIワークロードの運用可視性向上のため、ストリーミング応答の初期待ち時間（TTFT）とクォータ消費量のサーバーサイド監視を実現する。

メトリクスの特徴

追加費用なしで自動的に発行され、API変更やオプトイン不要。既存のメトリクスではカバーできなかったTTFTとトークン消費係数を考慮した実効クォータ消費量を補完する。

実用的な活用方法

アラーム設定、ベースライン確立、キャパシティのプロアクティブ管理に活用でき、レイテンシ敏感なアプリケーションや高スループットワークロードの運用改善に寄与する。

既存メトリクスとの関係

AWS/Bedrockネームスペースの既存メトリクス（呼び出し数、レイテンシ、エラー率、トークン使用量など）を補完し、より包括的な運用監視基盤を提供する。

ConverseStream APIによるストリーミング応答の監視

ConverseStream APIを使用すると、ストリーミング応答中にEstimatedTPMQuotaUsage（クォータ消費量）とTimeToFirstToken（最初のトークンまでの遅延）のメトリクスをリアルタイムで取得できます。

二つの主要なCloudWatchメトリクス

EstimatedTPMQuotaUsageはクォータ消費量を倍精度浮動小数点数で示し、TimeToFirstTokenはリクエストから最初のストリーミングトークンまでの遅延をミリ秒単位で測定します。

API別のメトリクス出力の違い

ConverseStreamとInvokeModelWithResponseStream（ストリーミングAPI）ではEstimatedTPMQuotaUsageとTimeToFirstTokenの両方が出力されるが、ConverseとInvokeModel（非ストリーミングAPI）ではEstimatedTPMQuotaUsageのみが出力される。

影響分析・編集コメントを表示

影響分析

この発表は、生成AIの本格的な運用段階において、パフォーマンス監視とリソース管理の重要性が高まっていることを反映している。特にストリーミング応答のユーザー体験向上と予期せぬスロットリング回避に直接寄与する実用的な機能強化であり、AWSの生成AIプラットフォーム競争力向上に貢献する。

編集コメント

AWSの生成AIプラットフォーム競争力強化の一環として、運用監視面での実用的な機能追加。特にストリーミング応答のTTFT監視は、チャットボットなどのリアルタイムアプリケーションにとって重要な改善点。

python

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.anthropic.claude-sonnet-4-6-v1',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'What is the capital of France?'}]
        }
    ]
)

print(response['output']['message']['content'][0]['text'])
print(f"Input tokens: {response['usage']['inputTokens']}")
print(f"Output tokens: {response['usage']['outputTokens']}")

ConverseStream (ストリーミング)

以下の例ではConverseStream APIを使用しています。このストリーミング呼び出しでは、クォータ消費量を示すEstimatedTPMQuotaUsage（倍精度浮動小数点数値）と、リクエストから最初のストリームトークンまでのレイテンシを測定するTimeToFirstToken（ミリ秒単位の値）の両方が出力されます。

python

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse_stream(
    modelId='us.anthropic.claude-sonnet-4-6-v1',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'What is the capital of France?'}]
        }
    ]
)

for event in response['stream']:
    if 'contentBlockDelta' in event:
        print(event['contentBlockDelta']['delta']['text'], end='')
print()

同じメトリクスはInvokeModel（非ストリーミング）およびInvokeModelWithResponseStream（ストリーミング）APIでも出力されます。以下の表は各APIが出力するメトリクスをまとめたものです：

API	出力メトリクス
Converse	EstimatedTPMQuotaUsage
ConverseStream	EstimatedTPMQuotaUsage, TimeToFirstToken
InvokeModel	EstimatedTPMQuotaUsage
InvokeModelWithResponseStream	EstimatedTPMQuotaUsage, TimeToFirstToken

これらのリクエストを実行した後、メトリクスが表示されるまで約1〜2分待機し、CloudWatchコンソールでメトリクス > すべてのメトリクス > AWS/Bedrockに移動して、モデルのデータポイントが存在することを確認してください。

AWS CLIを使用したメトリクスのクエリ

AWS CLIを使用して、新しいメトリクスが利用可能であることを確認し、その値を取得できます。まず、モデルに対してメトリクスが公開されていることを確認します：

bash

# 利用可能なTimeToFirstTokenメトリクスを一覧表示
aws cloudwatch list-metrics --namespace AWS/Bedrock --metric-name TimeToFirstToken

# 利用可能なEstimatedTPMQuotaUsageメトリクスを一覧表示
aws cloudwatch list-metrics --namespace AWS/Bedrock --metric-name EstimatedTPMQuotaUsage

原文を表示

As organizations scale their generative AI workloads on Amazon Bedrock, operational visibility into inference performance and resource consumption becomes critical. Teams running latency-sensitive applications must understand how quickly models begin generating responses. Teams managing high-throughput workloads must understand how their requests consume quota so they can avoid unexpected throttling. Until now, gaining this visibility required custom client-side instrumentation or reactive troubleshooting after issues occurred.

Today, we’re announcing two new Amazon CloudWatch metrics for Amazon Bedrock, TimeToFirstToken and EstimatedTPMQuotaUsage. These metrics give you server-side visibility into streaming latency and quota consumption. These metrics are automatically emitted for every successful inference request at no additional cost, with no API changes or opt-in required. They are available now in the AWS/Bedrock CloudWatch namespace.

In this post, we cover the following:

Why visibility into time-to-first-token latency and quota consumption matters for production AI workloads

How the new TimeToFirstToken and EstimatedTPMQuotaUsage metrics work

How to get started using these metrics to set alarms, establish baselines, and proactively manage capacity.

Amazon Bedrock already provides a set of CloudWatch metrics to help you monitor your inference workloads. The AWS/Bedrock namespace includes metrics such as Invocations, InvocationLatency, InvocationClientErrors, InvocationThrottles, InputTokenCount, and OutputTokenCount. These metrics provide visibility into request volume, end-to-end latency, error rates, and token usage. These metrics are available across the Converse, ConverseStream, InvokeModel, and InvokeModelWithResponseStream APIs and can be filtered by the ModelId dimension. While these metrics provide a strong operational foundation, they leave two important gaps: they don’t capture how quickly a streaming response begins (time-to-first-token), and they don’t reflect the effective quota consumed by a request after accounting for token burndown multipliers. The two new metrics announced today address exactly these gaps.

Observability needs of production AI Inference workloads

In streaming inference applications, like chatbots, coding assistants, or real-time content generation, the time it takes for the model to return its first token directly affects perceived responsiveness. A delay in the first token directly impacts the perceived responsiveness of your application, even when overall throughput remains within acceptable ranges. However, measuring this server-side metric previously required you to instrument your application code to capture timestamps around API calls. This added complexity and potentially introduced measurement inaccuracies that don’t reflect the actual service-side behavior.

Quota management presents a different but equally important challenge. Amazon Bedrock applies token burndown multipliers for certain models. This means that the effective quota consumed by a request can differ from the raw token counts that you see in billing metrics. For example, Anthropic Claude models, including Claude Sonnet 4.6, Claude Opus 4.6, Claude Sonnet 4.5, and Claude Opus 4.5, apply a 5x burndown multiplier on output tokens for quota purposes. This means a request that produces 100 output tokens effectively consumes 500 tokens of your Tokens Per Minute (TPM) quota. You are only billed for your actual token usage. Without visibility into this calculation, throttling can appear unpredictable, making it difficult to set appropriate alarms or plan capacity increases ahead of time. For customers using cross-Region inference profiles, these challenges are compounded because you need per-inference-profile visibility to understand performance and consumption across geographic and global configurations.

Understanding newly introduced metrics

The following diagram shows where each metric is captured during the lifecycle of a streaming inference request.

TimeToFirstToken:

The TimeToFirstToken metric measures the latency, in milliseconds, from when Amazon Bedrock receives your streaming request to when the service generates the first response token. This metric is emitted for the streaming APIs: ConverseStream and InvokeModelWithResponseStream. Because this metric is measured server-side, it reflects actual service-side latency without noise from network conditions or client-side processing.

With this metric, you can:

Set latency alarms – Create CloudWatch alarms that notify you when time-to-first-token exceeds acceptable thresholds, so you can detect performance degradation before it impacts your users.

Establish SLA baselines – Analyze historical TimeToFirstToken data across models to set informed performance baselines for your applications.

Diagnose performance issues – Correlate TimeToFirstToken with other Amazon Bedrock metrics such as InvocationLatency (Time to Last Token) to isolate whether latency issues stem from initial model response time or overall request processing.

The metric is published with the ModelId dimension, and optionally ServiceTier and ResolvedServiceTier dimensions. For cross-Region inference profiles, the ModelId corresponds to your inference profile identifier (for example, us.anthropic.claude-sonnet-4-5-v1), so you can monitor TimeToFirstToken separately for each profile. This metric is emitted only for successfully completed streaming requests.

EstimatedTPMQuotaUsage:

The EstimatedTPMQuotaUsage metric tracks the estimated Tokens Per Minute (TPM) quota consumed by your requests. Unlike raw token counts, this metric accounts for the factors that Amazon Bedrock uses when evaluating quota consumption, including cache write tokens and output token burndown multipliers. The metric name includes “Estimated” to reflect that it provides a close approximation of your quota consumption for monitoring and capacity planning purposes. The internal throttling decisions of Amazon Bedrock are based on real-time calculations that might differ slightly from this metric, but EstimatedTPMQuotaUsage is designed to give you a reliable, actionable signal. It should be accurate enough to set alarms, track consumption trends, and plan quota increases with confidence. This metric is emitted across all inference APIs, including Converse, InvokeModel, ConverseStream, and InvokeModelWithResponseStream.

Understanding the quota consumption formula:

The formula for calculating estimated quota consumption depends on your throughput type:

On-demand inference:

EstimatedTPMQuotaUsage = InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × burndown_rate)

The burndown rate varies by model — for the full list of models and their applicable rates, refer to Token burndown multipliers for quota management. For on-demand inference, cache read tokens don’t count toward the quota.

As an example, Claude Sonnet 4.5 has a 5x burndown rate on output tokens. An on-demand request with 1,000 input tokens, 200 cache write tokens, and 100 output tokens consumes 1,000 + 200 + (100 × 5) = 1,700 tokens of quota. This is 400 more than you might estimate from raw token counts alone.

Provisioned Throughput (reserved tier):

EstimatedTPMQuotaUsage = InputTokenCount + (CacheWriteInputTokens × 1.25) + (CacheReadInputTokens × 0.1) + OutputTokenCount

For Provisioned Throughput, the burndown multiplier on output tokens does not apply. However, cache read tokens contribute at a 0.1 rate and cache write tokens are weighted at 1.25.

Note that billing differs from quota usage — you are billed for actual token usage, not the burndown-adjusted or weighted amounts. For more details, refer to Token burndown multipliers for quota management.

With this metric, you can:

Set proactive quota alarms – Create CloudWatch alarms that trigger when your estimated quota usage approaches your TPM limit, so you can act before requests are throttled.

Track consumption across models – Compare quota usage across different models to understand which workloads are consuming the most capacity and optimize accordingly.

Plan quota increases – Use historical consumption trends to request quota increases through AWS service quotas before your usage growth leads to throttling.

Metric dimensions and filtering

Both metrics share the following characteristics:

The metrics include dimensions such as ModelId, allowing you to filter and aggregate data per model. When you use a cross-Region inference profile, whether geographic (for example, us.anthropic.claude-sonnet-4-5-v1) or global (for example, global.anthropic.claude-sonnet-4-5-v1), the ModelId dimension corresponds to your inference profile identifier. This means that you can view separate metrics for each cross-Region inference profile and model combination. This gives you granular visibility into performance and consumption across your inference configurations.

This is consistent with existing Amazon Bedrock CloudWatch metrics like *Invocations, *InvocationLatency*,* and token count metrics.

Attribute

TimeToFirstToken

EstimatedTPMQuotaUsage

CloudWatch namespace

AWS/Bedrock

Unit

Milliseconds

Count

Supported APIs

ConverseStream, InvokeModelWithResponseStream

Converse, InvokeModel, ConverseStream, InvokeModelWithResponseStream

Update frequency

1-minute aggregation

Scope

Successfully completed requests

Primary dimension

ModelId

Optional dimensions

ServiceTier, ResolvedServiceTier

ServiceTier, ResolvedServiceTier, ContextWindow (for input contexts exceeding 200K tokens)

Supported inference types

Cross-Region inference (geographic and global), in-Region inference

Cross-Region inference (geographic and global), in-region inference

Getting started

These metrics are already available in your CloudWatch dashboard. When your application calls an Amazon Bedrock inference API, the service processes the request, invokes the model, and publishes all applicable metrics — including TimeToFirstToken and EstimatedTPMQuotaUsage — to the AWS/Bedrock namespace in your CloudWatch account. You can then use CloudWatch dashboards, alarms, and metric math to monitor, alert on, and analyze these metrics. Complete the following steps to start using them:

Open the Amazon CloudWatch console and navigate to Metrics > All metrics.

Select the AWS/Bedrock namespace.

Find the TimeToFirstToken or EstimatedTPMQuotaUsage metrics and filter by ModelId to view data for specific models.

Create alarms to get notified of latency degradation or quota consumption approaching your limits.

Make inference requests and observe the new metrics

To generate metric data points, make inference requests against Amazon Bedrock. The following examples use the AWS SDK for Python (Boto3) to demonstrate a non-streaming request (which emits EstimatedTPMQuotaUsage) and a streaming request (which emits both EstimatedTPMQuotaUsage and TimeToFirstToken).

In these examples, we use us-east-1 as the AWS Region and us.anthropic.claude-sonnet-4-6-v1 as a cross-Region inference profile. Replace these with your own Region and model or inference profile ID.

Converse (non-streaming)

The following example uses the Converse API. This non-streaming call emits the EstimatedTPMQuotaUsage metric in CloudWatch under the AWS/Bedrock namespace.

code

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.anthropic.claude-sonnet-4-6-v1',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'What is the capital of France?'}]
        }
    ]
)

print(response['output']['message']['content'][0]['text'])
print(f"Input tokens: {response['usage']['inputTokens']}")
print(f"Output tokens: {response['usage']['outputTokens']}")

ConverseStream (streaming)

The following example uses the ConverseStream API. This streaming call emits both EstimatedTPMQuotaUsage (value in double) for quota consumption and TimeToFirstToken (value in milliseconds) measuring latency from request to the first streamed token.

code

import Boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse_stream(
    modelId='us.anthropic.claude-sonnet-4-6-v1',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'What is the capital of France?'}]
        }
    ]
)

for event in response['stream']:
    if 'contentBlockDelta' in event:
        print(event['contentBlockDelta']['delta']['text'], end='')
print()

The same metrics are emitted for the InvokeModel (non-streaming) and InvokeModelWithResponseStream (streaming) APIs. The following table summarizes which metrics each API emits:

API

Emitted metrics

Converse

EstimatedTPMQuotaUsage

ConverseStream

EstimatedTPMQuotaUsage, TimeToFirstToken

InvokeModel

EstimatedTPMQuotaUsage

InvokeModelWithResponseStream

EstimatedTPMQuotaUsage, TimeToFirstToken

After making these requests, allow approximately 1–2 minutes for the metrics to appear, then navigate to the CloudWatch console under Metrics > All metrics > AWS/Bedrock to verify that the data points are present for your model.

Query metrics using the AWS CLI

You can use the AWS CLI to verify that the new metrics are available and retrieve their values. First, confirm that the metrics are being published for your model:

List available TimeToFirstToken metrics

aws cloudwatch list-metrics --namespace AWS/Bedrock --metric-name TimeToFirstToken

List avai

この記事をシェア

The Decoder重要度42026年3月11日 01:10

アマゾン、AI生成コードのヒューマンフィルターとして上級エンジニアを配置

The Decoder重要度42026年3月11日 04:11

アマゾン、PerplexityのAIショッピングエージェントを差し止める仮処分命令を取得

TechCrunch AI重要度42026年3月11日 05:10

Amazonが自社ウェブサイトとアプリで医療AIアシスタントを開始

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

import boto3 bedrock = boto3.client('bedrock-runtime', region_name='us-east-1') response = bedrock.converse( modelId='us.anthropic.claude-sonnet-4-6-v1', messages=[ { 'role': 'user', 'content': [{'text': 'What is the capital of France?'}] } ] ) print(response['output']['message']['content'][0]['text']) print(f"Input tokens: {response['usage']['inputTokens']}") print(f"Output tokens: {response['usage']['outputTokens']}")

import boto3 bedrock = boto3.client('bedrock-runtime', region_name='us-east-1') response = bedrock.converse_stream( modelId='us.anthropic.claude-sonnet-4-6-v1', messages=[ { 'role': 'user', 'content': [{'text': 'What is the capital of France?'}] } ] ) for event in response['stream']: if 'contentBlockDelta' in event: print(event['contentBlockDelta']['delta']['text'], end='') print()

API

出力メトリクス

Converse

EstimatedTPMQuotaUsage

ConverseStream

EstimatedTPMQuotaUsage, TimeToFirstToken

InvokeModel

EstimatedTPMQuotaUsage

InvokeModelWithResponseStream

EstimatedTPMQuotaUsage, TimeToFirstToken

# 利用可能なTimeToFirstTokenメトリクスを一覧表示 aws cloudwatch list-metrics --namespace AWS/Bedrock --metric-name TimeToFirstToken # 利用可能なEstimatedTPMQuotaUsageメトリクスを一覧表示 aws cloudwatch list-metrics --namespace AWS/Bedrock --metric-name EstimatedTPMQuotaUsage

import Boto3 bedrock = boto3.client('bedrock-runtime', region_name='us-east-1') response = bedrock.converse_stream( modelId='us.anthropic.claude-sonnet-4-6-v1', messages=[ { 'role': 'user', 'content': [{'text': 'What is the capital of France?'}] } ] ) for event in response['stream']: if 'contentBlockDelta' in event: print(event['contentBlockDelta']['delta']['text'], end='') print()

Amazon Bedrockの推論ワークロード向け新CloudWatchメトリクスでTTFTと推定クォータ消費量の可視性を向上

キーポイント

影響分析

編集コメント

AWS CLIを使用したメトリクスのクエリ

Observability needs of production AI Inference workloads

Understanding newly introduced metrics

TimeToFirstToken:

EstimatedTPMQuotaUsage:

Understanding the quota consumption formula:

On-demand inference:

Provisioned Throughput (reserved tier):

Metric dimensions and filtering

Getting started

Make inference requests and observe the new metrics

Query metrics using the AWS CLI

List available TimeToFirstToken metrics

List avai

関連記事

Amazon Bedrockの推論ワークロード向け新CloudWatchメトリクスでTTFTと推定クォータ消費量の可視性を向上

キーポイント

影響分析

編集コメント

AWS CLIを使用したメトリクスのクエリ

Observability needs of production AI Inference workloads

Understanding newly introduced metrics

TimeToFirstToken:

EstimatedTPMQuotaUsage:

Understanding the quota consumption formula:

On-demand inference:

Provisioned Throughput (reserved tier):

Metric dimensions and filtering

Getting started

Make inference requests and observe the new metrics

Query metrics using the AWS CLI

List available TimeToFirstToken metrics

List avai

関連記事