AWS Machine Learning Blog·2026年3月26日 03:57·約15分

Amazon Bedrockのマルチモーダルモデルで大規模な動画分析を実現

#マルチモーダルAI #基盤モデル #ビデオ分析 #Amazon Bedrock #オープンソース #クラウドAI

TL;DR

AWSはAmazon Bedrockのマルチモーダル基盤モデルを活用し、フレームベース、シーン分割、ストリーミングの3つのアプローチで大規模なビデオ分析を可能にするオープンソースソリューションを提供し、従来の手法の限界を克服する。

AI深層分析2026年3月26日 04:41

重要/ 5段階

深度40%

キーポイント

従来のビデオ分析の限界

手動レビューやルールベースの従来手法は、スケーラビリティの制約、柔軟性の欠如、文脈理解の不足、統合の複雑さといった課題を抱えている。

マルチモーダル基盤モデルによるパラダイムシフト

Amazon Bedrockのマルチモーダル基盤モデルは視覚とテキスト情報を統合処理し、シーンの理解、自然言語記述の生成、ビデオ内容への質問応答、プログラムでは定義困難な微妙なイベントの検出を可能にする。

3つのビデオ理解アプローチ

ユースケースとコスト・パフォーマンスのトレードオフに応じて、フレームベース（高精度・大規模）、シーン分割（効率的な分析）、ストリーミング（リアルタイム処理）の3つの異なるワークフローを提供する。

オープンソース実装

完全なソリューションはGitHubでAWSサンプルとして公開されており、実装とカスタマイズが容易になっている。

影響分析・編集コメントを表示

影響分析

この記事は、大規模ビデオ分析における実用的なソリューションを提供することで、セキュリティ監視、メディア制作、ソーシャルプラットフォーム、企業コミュニケーションなど多様な分野でのAI適用を加速させる可能性がある。AWSが提供する具体的なアーキテクチャとオープンソース実装は、業界の実装ハードルを下げ、マルチモーダルAIの実用化を推進する重要な一歩となる。

編集コメント

AWSが提供する具体的な実装例と3つのアプローチの明確な比較は、マルチモーダルAIの実用化における重要なガイドラインとなる。オープンソース公開により、コミュニティの貢献とさらなる進化が期待できる。

動画コンテンツは現在、セキュリティ監視やメディア制作からソーシャルプラットフォーム、企業コミュニケーションに至るまで至る所に存在しています。しかし、大量の動画から意味のあるインサイトを抽出することは依然として大きな課題です。組織には、動画に何が映っているかだけでなく、その文脈、物語、そしてコンテンツの背後にある意味を理解できるソリューションが必要です。

本稿では、Amazon Bedrock のマルチモーダル基盤モデル（FMs）が、3 つの異なるアーキテクチャアプローチを通じてどのようにスケーラブルな動画理解を可能にするかを探索します。各アプローチは、異なるユースケースとコストパフォーマンスのトレードオフのために設計されています。完全なソリューションはオープンソースの AWS サンプル (GitHub) として利用可能です。

動画分析の進化

従来の動画分析アプローチは、手動レビューまたは事前に定義されたパターンを検出する基本的なコンピュータビジョン技術に依存しています。機能面では有用ですが、これらの方法には重大な限界があります：

スケーラビリティの制約：手動レビューは時間がかかり、コストも高い
柔軟性の限界：ルールベースシステムは新しいシナリオに適応できない
コンテキストの欠如：従来のコンピュータビジョン（Computer Vision）には意味理解が不足している
統合の複雑さ：現代のアプリケーションに組み込むことが困難

Amazon Bedrock におけるマルチモーダル基盤モデルの登場がこのパラダイムを変えます。これらのモデルは視覚情報とテキスト情報を同時に処理できます。これにより、シーンの理解、自然言語による記述の生成、動画コンテンツに関する質問への回答、プログラムで定義することが難しい微妙なイベントの検出が可能になります。

ビデオ理解のための 3 つのアプローチ

ビデオコンテンツの理解は本質的に複雑であり、意味のある洞察を得るためには視覚情報、聴覚情報、時間情報を組み合わせて分析する必要があります。メディアシーン分析、広告ブレイク検出、IP カメラ追跡、ソーシャルメディアモデレーションなど、異なるユースケースでは、コスト、精度、レイテンシのトレードオフが異なる独自のワークフローが必要です。本ソリューションは、特定のシナリオに最適化された異なるビデオ抽出方法を用いた 3 つの独自ワークフローを提供します。

フレームベースのワークフロー：スケーラブルな高精度

フレームベースのアプローチでは、画像フレームを固定間隔でサンプリングし、類似または重複するフレームを除去した上で、フレームレベルでの視覚情報を抽出するために画像理解ファウンデーションモデルを適用します。音声文字起こしは、Amazon Transcribe を用いて別途行われます。

このワークフローが特に適しているのは以下のケースです：

セキュリティおよび監視：時間軸にわたる特定の状況やイベントの検出
品質保証：製造プロセスまたは運用プロセスのモニタリング
コンプライアンス監視：安全プロトコルの遵守状況の確認

本アーキテクチャでは、AWS Step Functions を用いてパイプライン全体をオーケストレーションします。

スマートサンプリング：コストと品質の最適化

フレームベースワークフローの重要な特徴は、インテリジェントなフレーム重複排除機能です。これは視覚情報を保持しつつ冗長なフレームを除去することで、処理コストを大幅に削減します。本ソリューションでは、2 つの異なる類似度比較手法を提供しています。

Nova マルチモーダル埋め込み (MME) の比較では、Amazon Nova のマルチモーダル埋め込みモデルを使用して、各フレームの 256 次元ベクトル表現を生成します。各フレームは Nova MME モデルを用いてベクトル埋め込み符号化され、連続するフレーム間のコサイン距離が計算されます。距離しきい値（デフォルト 0.2、数値が小さいほど類似度が高い）未満のフレームは削除されます。このアプローチは画像コンテンツの意味的理解に優れており、照明や視点のわずかな変動に対して頑健でありながら、高レベルの視覚概念を捉えることができます。ただし、埋め込み生成には追加の Amazon Bedrock API 費用が発生し、フレームあたりのレイテンシがやや増加します。この方法は、画素レベルの違いよりも意味的な類似性が重要となるコンテンツ（例：シーン変更の検出やユニークな瞬間の特定）に対して推奨されます。

OpenCV ORB（Oriented FAST and Rotated BRIEF）は、コンピュータビジョンのアプローチを採用し、外部 API 呼び出しを必要とせずに特徴点検出を用いて連続するフレーム間のキーポイントを特定およびマッチングします。ORB は各フレームに対してキーポイントを検出し、バイナリ記述子を計算し、マッチングした特徴点の総キーポイント数に対する比率として類似度スコアを算出します。デフォルト閾値は 0.325 で（より高い値ほど高い類似度を意味します）、この手法は最小限のレイテンシと追加の API コストなしで高速処理を提供します。回転不変な特徴点マッチングにより、カメラの動きやフレーム遷移の検出に優れています。ただし、明らかな照明変化には敏感であり、埋め込みベースのアプローチほど意味的な類似度を効果的に捉えられない可能性があります。この方法は、監視映像のような静止カメラシナリオや、画素レベルの類似度で十分なコスト感応型のアプリケーションに対して推奨されます。

ショットベースのワークフロー：物語の流れの理解

個別のフレームをサンプリングするのではなく、ショットベースのワークフローでは動画を短いクリップ（ショット）または固定長のセグメントに分割し、各セグメントに対してビデオ理解ファウンデーションモデルを適用します。このアプローチは、各ショット内の時間的コンテキストを捉えつつ、より長い動画の処理に対する柔軟性を維持します。

各ショットに対してセマンティックラベルと埋め込みベクトルを生成することで、この手法は精度と柔軟性のバランスを保ちながら、効率的なビデオ検索および取得を可能にします。アーキテクチャでは、後続のステップでの並列処理のためにショットを 10 個ずつバッチ化してグループ化し、スループットを向上させつつ AWS Lambda の同時実行制限を管理しています。

このワークフローは以下の点で優れています:

メディア制作: チャンプーマーカーやシーン記述のための映像分析
コンテンツカタログ作成: ビデオライブラリの自動タグ付けおよび整理
ハイライト生成: 長尺コンテンツ内の重要な瞬間の特定

ビデオセグメンテーション：2 つのアプローチ

ショットベースのワークフローは、異なるビデオ特性やユースケースに合わせた柔軟なセグメンテーションオプションを提供します。システムは、Amazon Simple Storage Service (Amazon S3) からビデオファイルを AWS Lambda の一時ストレージへダウンロードし、設定パラメータに基づいて選択されたセグメンテーションアルゴリズムを適用します。

OpenCV Scene Detection は、コンテンツ内の視覚的変化に基づいて自動的に動画をセグメントに分割します。このアプローチは PySceneDetect ライブラリを使用して、カット、カメラ切り替え、または視覚的なコンテンツの大幅な変化などのトランジションを検出します。

自然なシーンの境界を特定することで、システムは関連する瞬間をグループ化して保持します。これにより、この手法は映画、テレビ番組、プレゼンテーション、ブログ動画など、シーンが意味のあるコンテンツ単位を表す編集済みまたは物語駆動型のビデオにおいて特に効果的です。セグメンテーションはビデオ自体の構造に従うため、セグメント長はペーストや編集スタイルに応じて変動します。

固定期間セグメンテーション（Fixed-Duration Segmentation） は、ビデオ内で何が起こっているかに関わらず、等しい長さの時間間隔にビデオを分割します。

各セグメントは一貫した期間（例えば 10 秒）をカバーし、予測可能で均一なクリップを作成します。このアプローチは処理を簡素化し、処理時間とコストの見積もりを改善します。アクションの最中にシーンを分断する可能性がありますが、固定期間セグメンテーションは、ナラティブの境界を維持することよりも定期的な時間のサンプリングが重要である監視映像、スポーツイベント、またはライブストリームなどの連続記録においてよく機能します。

多モーダル埋め込み：意味的ビデオ検索

多モーダル埋め込み（Multimodal embedding）は、特にビデオの意味的検索アプリケーションにおいて強力な、ビデオ理解のための新興アプローチを表しています。このソリューションでは、Amazon Bedrock で利用可能な Amazon Nova Multimodal Embedding および TwelveLabs Marengo モデルを使用したワークフローを提供します。

これらのワークフローにより、以下が可能になります：

自然言語検索：テキストクエリを使用してビデオセグメントを検索
ビジュアル類似性検索：参照画像を使用してコンテンツを検出
クロスモーダル検索：テキストと視覚的コンテンツの間のギャップを埋める

このアーキテクチャは、統一されたインターフェースを持つ埋め込みモデルをサポートします。

コストとパフォーマンスのトレードオフの理解

本番環境でのビデオ分析における主要な課題の一つは、品質を維持しながらコストを管理することです。本ソリューションには組み込みのトークン使用状況追跡およびコスト見積もり機能が備わっており、モデル選択やワークフロー設定に関する意思決定を支援します。

前のスクリーンショットは、形式を示すために本ソリューションによって生成されたサンプルのコスト見積もりを示しています。これは価格情報のソースとして使用すべきではありません。処理される各ビデオに対して、Amazon Bedrock 基盤モデルおよび音声文字起こし用の Amazon Transcribe を含む、モデルタイプごとの詳細なコスト内訳が提供されます。この可視性により、特定の要件と予算制約に基づいて設定を改善することができます。

システムアーキテクチャ

完全なソリューションは、スケーラビリティとコスト効率を提供する AWS サーバーレスサービス上で構築されています。

アーキテクチャには以下が含まれます：

Extraction Service: Step Functions を用いてフレームベースおよびショットベースのワークフローをオーケストレーションします
Nova Service: ベクトル検索を備えた Nova Multimodal Embedding のバックエンドです
TwelveLabs Service: ベクトル検索を備えた Marengo 埋め込みモデルのバックエンドです
Agent Service: ワークフロー推奨を行う Amazon Bedrock Agents を活用した AI アシスタントです
Frontend: ユーザーインタラクションのために Amazon CloudFront で配信される React アプリケーションです
Analytics Service: 下流分析のパターンを示すサンプルノートブックです

ビデオメタデータへのアクセス

本ソリューションは、柔軟なアクセスを可能にするため、抽出されたメタデータを複数の形式で保存します:

Amazon S3: タスク ID とデータタイプ別に整理された、生基盤モデルの出力、完全なタスクメタデータ、および処理済みアセット。
Amazon DynamoDB: 異なるサービス向けに複数テーブルを構成し、ビデオ、タイムスタンプ、または分析タイプごとに検索に最適化された構造化・照会可能なデータ。
Programmatic API: 自動化、バッチ処理、既存パイプラインへの統合のための直接呼び出し機能。

この柔軟なアクセスモデルを活用することで、ノートブックでの探索的分析、自動化パイプラインの構築、あるいは本番環境アプリケーションの開発など、あらゆるワークフローにツールを統合できます。

リアルワールドユースケース

本ソリューションには、3 つの一般的なシナリオを示すサンプルノートブックが含まれています。

IP カメライベント検出：常時人の監視なしに、監視映像を自動的にモニタリングして特定のイベントや条件を検出します。

メディア章分析：長編ビデオコンテンツを論理的な章に分割し、自動的な説明とメタデータを付与します。

ソーシャルメディアコンテンツモデレーション：プラットフォームのガイドラインが満たされているかを確認するため、ユーザー生成のビデオコンテンツを大規模にレビューします。

これらの例は、特定のユースケースに合わせて拡張・カスタマイズできる出発点を提供するものです。

始め方

ソリューションのデプロイ

本ソリューションは GitHub の CDK パッケージとして利用可能で、数コマンドを実行するだけで AWS アカウントにデプロイできます。デプロイにより、以下のすべての必要なリソースが作成されます：

オーケストレーション用の Step Functions ステートマシン

処理ロジック用の Lambda 関数

メタデータ保存用の DynamoDB テーブル

アセット保存用の S3 バケット

ウェブインターフェース用の CloudFront ディストリビューション

認証用の Amazon Cognito ユーザープール

デプロイ後、すぐにビデオのアップロードを開始し、異なる分析パイプラインやファウンデーションモデルを実験して、設定間のパフォーマンスを比較することが可能です。

結論

ビデオ理解はもはや、専門的なコンピュータビジョンチームとインフラを備えた組織に限定されません。Amazon Bedrock のマルチモーダル基盤モデルと AWS サーバーレスサービスを組み合わせることで、高度なビデオ分析が手頃でコスト効果の高いものとなっています。セキュリティ監視システムの構築、メディア制作ツールの開発、あるいはコンテンツモデレーションプラットフォームの運用など、このソリューションで示された 3 つのアーキテクチャアプローチは、異なる要件に対応した柔軟な出発点を提供します。重要なのは、ユースケースに合った適切なアプローチを選択することです：精密な監視にはフレームベース、物語性のあるコンテンツにはショットベース、意味的な検索には埋め込みベースのアプローチが適しています。マルチモーダルモデルがさらに進化するにつれ、より洗練されたビデオ理解機能が次々と登場していくでしょう。未来は、単にビデオフレームを「見る」だけでなく、そのフレームが語るストーリーを真に「理解」する AI にかかっています。

準備はできましたか？

ガイド付き学習のためにハンズオンワークショップを試す
デプロイ手順とソースコードを確認するために GitHub リポジトリを探索する

さらに詳しく学ぶ:

Amazon Bedrock
Amazon Bedrock Multimodal Models
AWS Step Functions
Amazon Transcribe

著者について

image

ラナ・チャン

ラナ・チャンは、AWS の世界専門組織に所属する生成 AI 担当のシニアスペシャリストソリューションアーキテクトです。AI/ML を専門としており、特に AI ボイスアシスタントや多モーダル理解などのユースケースに注力しています。メディア・エンターテインメント、ゲーム、スポーツ、広告、金融サービス、ヘルスケアなど多様な業界の顧客と緊密に連携し、AI を活用してビジネスソリューションの変革を支援しています。

image

シャーロン・リー

シャーソン・リーは、マサチューセッツ州ボストンに拠点を置く Amazon Web Services (AWS) の AI/ML スペシャリストソリューションアーキテクトです。最先端技術の活用への情熱を持ち、AWS クラウドプラットフォーム上で革新的な生成 AI ソリューションの開発と展開を先導しています。

原文を表示

Video content is now everywhere, from security surveillance and media production to social platforms and enterprise communications. However, extracting meaningful insights from large volumes of video remains a major challenge. Organizations need solutions that can understand not only what appears in a video, but also the context, narrative, and underlying meaning of the content.

In this post, we explore how the multimodal foundation models (FMs) of Amazon Bedrock enable scalable video understanding through three distinct architectural approaches. Each approach is designed for different use cases and cost-performance trade-offs. The complete solution is available as an open source AWS sample on GitHub.

The evolution of video analysis

Traditional video analysis approaches rely on manual review or basic computer vision techniques that detect predefined patterns. While functional, these methods face significant limitations:

Scale constraints: Manual review is time-consuming and expensive

Limited flexibility: Rule-based systems can’t adapt to new scenarios

Context blindness: Traditional CV lacks semantic understanding

Integration complexity: Difficult to incorporate into modern applications

The emergence of multimodal foundation models on Amazon Bedrock changes this paradigm. These models can process both visual and textual information together. This enables them to understand scenes, generate natural language descriptions, answer questions about video content, and detect nuanced events that would be difficult to define programmatically.

Three approaches to video understanding

Understanding video content is inherently complex, combining visual, auditory, and temporal information that must be analyzed together for meaningful insights. Different use cases, such as media scene analysis, ad break detection, IP camera tracking, or social media moderation, require distinct workflows with varying cost, accuracy, and latency trade-offs.This solution provides three distinct workflows, each using different video extraction methods optimized for specific scenarios.

Frame-based workflow: precision at scale

The frame-based approach samples image frames at fixed intervals, removes similar or redundant frames, and applies image understanding foundation models to extract visual information at the frame level. Audio transcription is performed separately using Amazon Transcribe.

This workflow is ideal for:

Security and surveillance: Detect specific conditions or events across time

Quality assurance: Monitor manufacturing or operational processes

Compliance monitoring: Verify adherence to safety protocols

The architecture uses AWS Step Functions to orchestrate the entire pipeline:

Smart sampling: optimizing cost and quality

A key feature of the frame-based workflow is intelligent frame deduplication, which significantly reduces processing costs by removing redundant frames while preserving visual information. The solution provides two distinct similarity comparison methods.

Nova Multimodal Embeddings (MME) Comparison uses the multimodal embeddings model of Amazon Nova to generate 256-dimensional vector representations of each frame. Each frame is encoded into a vector embedding using the Nova MME model, and the cosine distance between consecutive frames is computed. Frames with distance below the threshold (default 0.2, where lower values indicate higher similarity) are removed. This approach excels at semantic understanding of image content, remaining robust to minor variations in lighting and perspective while capturing high-level visual concepts. However, it incurs additional Amazon Bedrock API costs for embedding generation and adds slightly higher latency per frame. This method is recommended for content where semantic similarity matters more than pixel-level differences, such as detecting scene changes or identifying unique moments.

OpenCV ORB (Oriented FAST and Rotated BRIEF) takes a computer vision approach, using feature detection to identify and match key points between consecutive frames without requiring external API calls. ORB detects key points and computes binary descriptors for each frame, calculating the similarity score as the ratio of matched features to total key points. With a default threshold of 0.325 (where higher values indicate higher similarity), this method offers fast processing with minimal latency and no additional API costs. The rotation-invariant feature matching makes it excellent for detecting camera movement and frame transitions. However, it can be sensitive to significant lighting changes and may not capture semantic similarity as effectively as embedding-based approaches. This method is recommended for static camera scenarios like surveillance footage, or cost-sensitive applications where pixel-level similarity is sufficient.

Shot-based workflow: understanding narrative flow

Instead of sampling individual frames, the shot-based workflow segments video into short clips (shots) or fixed-duration segments and applies video understanding foundation models to each segment. This approach captures temporal context within each shot while maintaining the flexibility to process longer videos.

By generating both semantic labels and embeddings for each shot, this method enables efficient video search and retrieval while balancing accuracy and flexibility. The architecture groups shots into batches of 10 for parallel processing in subsequent steps, improving throughput while managing AWS Lambda concurrency limits.

This workflow excels at:

Media production: Analyze footage for chapter markers and scene descriptions

Content cataloging: Automatically tag and organize video libraries

Highlight generation: Identify key moments in long-form content

Video segmentation: two approaches

The shot-based workflow provides flexible segmentation options to match different video characteristics and use cases. The system downloads the video file from Amazon Simple Storage Service (Amazon S3) to temporary storage in AWS Lambda, then applies the selected segmentation algorithm based on the configuration parameters.

OpenCV Scene Detection automatically divides a video into segments based on visual changes in the content. This approach uses the PySceneDetect library to detect transitions such as cuts, camera changes, or significant shifts in visual content.

By identifying natural scene boundaries, the system keeps related moments grouped together. This makes the method particularly effective for edited or narrative-driven videos such as movies, TV shows, presentations, and vlogs, where scenes represent meaningful units of content. Because segmentation follows the structure of the video itself, segment lengths can vary depending on the pacing and editing style.

Fixed-Duration Segmentation divides a video into equal-length time intervals, regardless of what is happening in the video.

Each segment covers a consistent duration (for example, 10 seconds), creating predictable and uniform clips. This approach streamlines processing and improves processing time and cost estimations. Although it might split scenes mid-action, fixed-duration segmentation works well for continuous recordings such as surveillance footage, sports events, or live streams, where regular time sampling is more important than preserving narrative boundaries.

Multimodal embedding: semantic video search

Multimodal embedding represents an emerging approach to video understanding, particularly powerful for video semantic search applications. The solution offers workflows using Amazon Nova Multimodal Embedding and TwelveLabs Marengo models available on Amazon Bedrock.

These workflows enable:

Natural language search: Find video segments using text queries

Visual similarity search: Locate content using reference images

Cross-modal retrieval: Bridge the gap between text and visual content

The architecture supports both embedding models with a unified interface:

Understanding cost and performance trade-offs

One of the key challenges in production video analysis is managing costs while maintaining quality. The solution provides built-in token usage tracking and cost estimation to help you make informed decisions about model selection and workflow configuration.

The previous screenshot shows a sample cost estimate generated by the solution to illustrate the format. It should not be used as a pricing source.For each processed video, you receive a detailed cost breakdown by model type, covering Amazon Bedrock foundation models and Amazon Transcribe for audio transcription. With this visibility, you can improve your configuration based on your specific requirements and budget constraints.

System architecture

The complete solution is built on AWS serverless services, providing scalability and cost-efficiency:

The architecture includes:

Extraction Service: Orchestrates frame-based and shot-based workflows using Step Functions

Nova Service: Backend for Nova Multimodal Embedding with vector search

TwelveLabs Service: Backend for Marengo embedding models with vector search

Agent Service: AI assistant powered by Amazon Bedrock Agents for workflow recommendations

Frontend: React application served using Amazon CloudFront for user interaction

Analytics Service: Sample notebooks demonstrating downstream analysis patterns

Accessing your video metadata

The solution stores extracted metadata in multiple formats for flexible access:

Amazon S3: Raw foundation model outputs, complete task metadata, and processed assets organized by task ID and data type.

Amazon DynamoDB: Structured, queryable data optimized for retrieval by video, timestamp, or analysis type across multiple tables for different services.

Programmatic API: Direct invocation for automation, bulk processing, and integration into existing pipelines.

You can use this flexible access model to integrate the tool into your workflows—whether conducting exploratory analysis in notebooks, building automated pipelines, or developing production applications.

Real-world use cases

The solution includes sample notebooks demonstrating three common scenarios:

IP Camera Event Detection: Automatically monitor surveillance footage for specific events or conditions without constant human oversight.

Media Chapter Analysis: Segment long-form video content into logical chapters with automatic descriptions and metadata.

Social Media Content Moderation: Review user-generated video content at scale to ensure that platform guidelines are met.

These examples provide starting points that you can extend and customize for your specific use cases.

Getting started

Deploy the solution

The solution is available as a CDK package on GitHub and can be deployed to your AWS account with only a few commands. The deployment creates all necessary resources including:

Step Functions state machines for orchestration

Lambda functions for processing logic

DynamoDB tables for metadata storage

S3 buckets for asset storage

CloudFront distribution for the web interface

Amazon Cognito user pool for authentication

After deployment, you can immediately start uploading videos, experimenting with different analysis pipelines and foundation models, and comparing performance across configurations.

Conclusion

Video understanding is no longer limited to organizations with specialized computer vision teams and infrastructure. The multimodal foundation models of Amazon Bedrock, combined with AWS serverless services, make sophisticated video analysis accessible and cost-effective.Whether you’re building security monitoring systems, media production tools, or content moderation platforms, the three architectural approaches demonstrated in this solution provide flexible starting points designed for different requirements. The key is choosing the right approach for your use case: frame-based for precision monitoring, shot-based for narrative content, and embedding-based for semantic search.As multimodal models continue to evolve, we will see even more sophisticated video understanding capabilities emerge. The future is about AI that doesn’t only see video frames, but truly understands the story they tell.

Ready to get started?

Try the hands-on workshop for guided learning

Explore the GitHub repository for deployment instructions and source code

Learn more:

Amazon Bedrock

Amazon Bedrock Multimodal Models

AWS Step Functions

Amazon Transcribe

About the authors

Lana Zhang

Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.

Sharon Li

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

この記事をシェア

AWS Machine Learning Blog重要度42026年3月27日 02:10

Amazon Polly双方向ストリーミングの紹介：会話型AIのためのリアルタイム音声合成

AWS Machine Learning Blog重要度42026年3月27日 02:26

Amazon Bedrock Guardrailsで年齢対応・文脈認識AIを構築

InfoQ重要度42026年3月26日 19:04

AWS S3がアカウント・リージョナル名前空間を導入、18年間続いたグローバルバケット名衝突を解消

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

AWS Machine Learning Blog·2026年3月26日 03:57·約15分

Amazon Bedrockのマルチモーダルモデルで大規模な動画分析を実現

#マルチモーダルAI #基盤モデル #ビデオ分析 #Amazon Bedrock #オープンソース #クラウドAI

TL;DR

AI深層分析2026年3月26日 04:41

重要/ 5段階

深度40%

キーポイント

従来のビデオ分析の限界

手動レビューやルールベースの従来手法は、スケーラビリティの制約、柔軟性の欠如、文脈理解の不足、統合の複雑さといった課題を抱えている。

マルチモーダル基盤モデルによるパラダイムシフト

3つのビデオ理解アプローチ

オープンソース実装

完全なソリューションはGitHubでAWSサンプルとして公開されており、実装とカスタマイズが容易になっている。

影響分析・編集コメントを表示

影響分析

編集コメント

動画分析の進化

スケーラビリティの制約：手動レビューは時間がかかり、コストも高い
柔軟性の限界：ルールベースシステムは新しいシナリオに適応できない
コンテキストの欠如：従来のコンピュータビジョン（Computer Vision）には意味理解が不足している
統合の複雑さ：現代のアプリケーションに組み込むことが困難

ビデオ理解のための 3 つのアプローチ

フレームベースのワークフロー：スケーラブルな高精度

このワークフローが特に適しているのは以下のケースです：

セキュリティおよび監視：時間軸にわたる特定の状況やイベントの検出
品質保証：製造プロセスまたは運用プロセスのモニタリング
コンプライアンス監視：安全プロトコルの遵守状況の確認

本アーキテクチャでは、AWS Step Functions を用いてパイプライン全体をオーケストレーションします。

スマートサンプリング：コストと品質の最適化

ショットベースのワークフロー：物語の流れの理解

このワークフローは以下の点で優れています:

メディア制作: チャンプーマーカーやシーン記述のための映像分析
コンテンツカタログ作成: ビデオライブラリの自動タグ付けおよび整理
ハイライト生成: 長尺コンテンツ内の重要な瞬間の特定

ビデオセグメンテーション：2 つのアプローチ

多モーダル埋め込み：意味的ビデオ検索

これらのワークフローにより、以下が可能になります：

自然言語検索：テキストクエリを使用してビデオセグメントを検索
ビジュアル類似性検索：参照画像を使用してコンテンツを検出
クロスモーダル検索：テキストと視覚的コンテンツの間のギャップを埋める

このアーキテクチャは、統一されたインターフェースを持つ埋め込みモデルをサポートします。

コストとパフォーマンスのトレードオフの理解

システムアーキテクチャ

完全なソリューションは、スケーラビリティとコスト効率を提供する AWS サーバーレスサービス上で構築されています。

アーキテクチャには以下が含まれます：

Extraction Service: Step Functions を用いてフレームベースおよびショットベースのワークフローをオーケストレーションします
Nova Service: ベクトル検索を備えた Nova Multimodal Embedding のバックエンドです
TwelveLabs Service: ベクトル検索を備えた Marengo 埋め込みモデルのバックエンドです
Agent Service: ワークフロー推奨を行う Amazon Bedrock Agents を活用した AI アシスタントです
Frontend: ユーザーインタラクションのために Amazon CloudFront で配信される React アプリケーションです
Analytics Service: 下流分析のパターンを示すサンプルノートブックです

ビデオメタデータへのアクセス

本ソリューションは、柔軟なアクセスを可能にするため、抽出されたメタデータを複数の形式で保存します:

Amazon S3: タスク ID とデータタイプ別に整理された、生基盤モデルの出力、完全なタスクメタデータ、および処理済みアセット。
Amazon DynamoDB: 異なるサービス向けに複数テーブルを構成し、ビデオ、タイムスタンプ、または分析タイプごとに検索に最適化された構造化・照会可能なデータ。
Programmatic API: 自動化、バッチ処理、既存パイプラインへの統合のための直接呼び出し機能。

リアルワールドユースケース

本ソリューションには、3 つの一般的なシナリオを示すサンプルノートブックが含まれています。

IP カメライベント検出：常時人の監視なしに、監視映像を自動的にモニタリングして特定のイベントや条件を検出します。

メディア章分析：長編ビデオコンテンツを論理的な章に分割し、自動的な説明とメタデータを付与します。

ソーシャルメディアコンテンツモデレーション：プラットフォームのガイドラインが満たされているかを確認するため、ユーザー生成のビデオコンテンツを大規模にレビューします。

これらの例は、特定のユースケースに合わせて拡張・カスタマイズできる出発点を提供するものです。

始め方

ソリューションのデプロイ

オーケストレーション用の Step Functions ステートマシン

処理ロジック用の Lambda 関数

メタデータ保存用の DynamoDB テーブル

アセット保存用の S3 バケット

ウェブインターフェース用の CloudFront ディストリビューション

認証用の Amazon Cognito ユーザープール

結論

準備はできましたか？

ガイド付き学習のためにハンズオンワークショップを試す
デプロイ手順とソースコードを確認するために GitHub リポジトリを探索する

さらに詳しく学ぶ:

Amazon Bedrock
Amazon Bedrock Multimodal Models
AWS Step Functions
Amazon Transcribe

著者について

image

ラナ・チャン

image

シャーロン・リー

原文を表示

The evolution of video analysis

Traditional video analysis approaches rely on manual review or basic computer vision techniques that detect predefined patterns. While functional, these methods face significant limitations:

Scale constraints: Manual review is time-consuming and expensive

Limited flexibility: Rule-based systems can’t adapt to new scenarios

Context blindness: Traditional CV lacks semantic understanding

Integration complexity: Difficult to incorporate into modern applications

Three approaches to video understanding

Frame-based workflow: precision at scale

This workflow is ideal for:

Security and surveillance: Detect specific conditions or events across time

Quality assurance: Monitor manufacturing or operational processes

Compliance monitoring: Verify adherence to safety protocols

The architecture uses AWS Step Functions to orchestrate the entire pipeline:

Smart sampling: optimizing cost and quality

Shot-based workflow: understanding narrative flow

This workflow excels at:

Media production: Analyze footage for chapter markers and scene descriptions

Content cataloging: Automatically tag and organize video libraries

Highlight generation: Identify key moments in long-form content

Video segmentation: two approaches

Fixed-Duration Segmentation divides a video into equal-length time intervals, regardless of what is happening in the video.

Multimodal embedding: semantic video search

These workflows enable:

Natural language search: Find video segments using text queries

Visual similarity search: Locate content using reference images

Cross-modal retrieval: Bridge the gap between text and visual content

The architecture supports both embedding models with a unified interface:

Understanding cost and performance trade-offs

System architecture

The complete solution is built on AWS serverless services, providing scalability and cost-efficiency:

The architecture includes:

Extraction Service: Orchestrates frame-based and shot-based workflows using Step Functions

Nova Service: Backend for Nova Multimodal Embedding with vector search

TwelveLabs Service: Backend for Marengo embedding models with vector search

Agent Service: AI assistant powered by Amazon Bedrock Agents for workflow recommendations

Frontend: React application served using Amazon CloudFront for user interaction

Analytics Service: Sample notebooks demonstrating downstream analysis patterns

Accessing your video metadata

The solution stores extracted metadata in multiple formats for flexible access:

Amazon S3: Raw foundation model outputs, complete task metadata, and processed assets organized by task ID and data type.

Amazon DynamoDB: Structured, queryable data optimized for retrieval by video, timestamp, or analysis type across multiple tables for different services.

Programmatic API: Direct invocation for automation, bulk processing, and integration into existing pipelines.

Real-world use cases

The solution includes sample notebooks demonstrating three common scenarios:

IP Camera Event Detection: Automatically monitor surveillance footage for specific events or conditions without constant human oversight.

Media Chapter Analysis: Segment long-form video content into logical chapters with automatic descriptions and metadata.

Social Media Content Moderation: Review user-generated video content at scale to ensure that platform guidelines are met.

These examples provide starting points that you can extend and customize for your specific use cases.

Getting started

Deploy the solution

The solution is available as a CDK package on GitHub and can be deployed to your AWS account with only a few commands. The deployment creates all necessary resources including:

Step Functions state machines for orchestration

Lambda functions for processing logic

DynamoDB tables for metadata storage

S3 buckets for asset storage

CloudFront distribution for the web interface

Amazon Cognito user pool for authentication

After deployment, you can immediately start uploading videos, experimenting with different analysis pipelines and foundation models, and comparing performance across configurations.

Conclusion

Ready to get started?

Try the hands-on workshop for guided learning

Explore the GitHub repository for deployment instructions and source code

Learn more:

Amazon Bedrock

Amazon Bedrock Multimodal Models

AWS Step Functions

Amazon Transcribe

About the authors

Lana Zhang

Sharon Li

この記事をシェア

AWS Machine Learning Blog重要度42026年3月27日 02:10

Amazon Polly双方向ストリーミングの紹介：会話型AIのためのリアルタイム音声合成

AWS Machine Learning Blog重要度42026年3月27日 02:26

Amazon Bedrock Guardrailsで年齢対応・文脈認識AIを構築

InfoQ重要度42026年3月26日 19:04

AWS S3がアカウント・リージョナル名前空間を導入、18年間続いたグローバルバケット名衝突を解消

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

動画分析の進化

ビデオ理解のための 3 つのアプローチ

フレームベースのワークフロー：スケーラブルな高精度

スマートサンプリング：コストと品質の最適化

ショットベースのワークフロー：物語の流れの理解

ビデオセグメンテーション：2 つのアプローチ

多モーダル埋め込み：意味的ビデオ検索

コストとパフォーマンスのトレードオフの理解

システムアーキテクチャ

ビデオメタデータへのアクセス

リアルワールドユースケース

始め方

ソリューションのデプロイ

結論

準備はできましたか？

さらに詳しく学ぶ:

著者について

ラナ・チャン

シャーロン・リー

The evolution of video analysis

Three approaches to video understanding

Frame-based workflow: precision at scale

Smart sampling: optimizing cost and quality

Shot-based workflow: understanding narrative flow

Video segmentation: two approaches

Multimodal embedding: semantic video search

Understanding cost and performance trade-offs

System architecture

Accessing your video metadata

Real-world use cases

Getting started

Deploy the solution

Conclusion

Ready to get started?

Learn more:

About the authors

Lana Zhang

Sharon Li

関連記事

キーポイント

影響分析

編集コメント

動画分析の進化

ビデオ理解のための 3 つのアプローチ

フレームベースのワークフロー：スケーラブルな高精度

スマートサンプリング：コストと品質の最適化

ショットベースのワークフロー：物語の流れの理解

ビデオセグメンテーション：2 つのアプローチ

多モーダル埋め込み：意味的ビデオ検索

コストとパフォーマンスのトレードオフの理解

システムアーキテクチャ

ビデオメタデータへのアクセス

リアルワールドユースケース

始め方

ソリューションのデプロイ

結論

準備はできましたか？

さらに詳しく学ぶ:

著者について

ラナ・チャン

シャーロン・リー

The evolution of video analysis

Three approaches to video understanding

Frame-based workflow: precision at scale

Smart sampling: optimizing cost and quality

Shot-based workflow: understanding narrative flow

Video segmentation: two approaches

Multimodal embedding: semantic video search

Understanding cost and performance trade-offs

System architecture

Accessing your video metadata

Real-world use cases

Getting started

Deploy the solution

Conclusion

Ready to get started?

Learn more: