TLDR AI·2026年6月18日 09:00·約13分で読める

NVIDIA XR AI を用いた AR グラスおよび XR デバイス向け AI エージェントの構築

#XR Agents #NVIDIA Cosmos #Nemotron #NeMo Agent Toolkit #Multimodal AI #Edge Computing

TL;DR

NVIDIA が XR デバイス向け AI エージェント構築のためのオープンソース基盤「XR AI」をベータ公開し、視覚・音声・ツール連携による実務支援の実現に向けた具体的なアーキテクチャと事例を示した。

AI深層分析2026年6月18日 17:07

重要/ 5段階

深度40%

キーポイント

開発インフラのギャップ解消と XR AI の登場

AR グラスやウェアラブルデバイス向けに、カメラ・マイクストリーム、マルチモーダルモデル、エンタープライズデータ連携を統合する「再利用可能な基盤」として NVIDIA XR AI がオープンソース化された。

インテリジェント XR エージェントの機能とアーキテクチャ

ユーザーの視界を理解し、音声で意図を認識し、エンタープライズツールを呼び出して応答するエージェントが構築可能であり、Cosmos による視覚的グラウンディングや Nemotron モデルによる音声対話、NeMo Agent Toolkit によるオーケストレーションを組み合わせている。

医療・製造業における具体的な活用事例

スタンフォード大学とプリンストン大学の研究チームが幹細胞治療の研究支援に、Siemens が工場エンジニアのメンテナンス支援やトラブルシューティングにそれぞれ適用する可能性を探っている。

モジュラーな XR AI アーキテクチャ

メディア転送、モデルサービス、ツールアクセスなどを分離することで、不要な推論やデータ移動を削減し、開発者がコンポーネントを柔軟に交換可能にする設計です。

マルチユーザー・マルチエージェント対応

参加者IDを経由境界として利用し、複数のクライアントが同じストリームを観察したり、各応答が正しくルーティングされたりする基盤を提供します。

包括的な機能サポートとベータ版公開

Cosmos による視覚的グラウンディングや Nemotron による言語推論、MCP を介した企業ツール連携など、多様な機能を備えた XR AI が公的ベータとして利用可能です。

柔軟なモデルアーキテクチャ

構成ファイルを通じて論理サービス名を参照する仕組みにより、開発者はエンドポイントの交換やクラウドホスト型モデルの導入をエージェントロジックを変更せずに実現できます。

影響分析・編集コメントを表示

影響分析

この発表は、AR/XR デバイスのハードウェア成熟を背景に、実用的な AI エージェント開発の障壁を大幅に下げる転換点となる。特に、複雑な現場業務（医療・製造など）において、ハンズフリーで AI と協働するワークフローが標準化されれば、産業用 XR の普及速度と ROI が劇的に向上する可能性がある。

編集コメント

ハードウェアの成熟と AI エージェント技術が融合し、産業現場での実用化が現実味を帯びてきたことを示す重要なマイルストーンです。特にオープンソース基盤の提供により、開発者の参入障壁が下がる点は業界全体にとってプラス材料となります。

ARグラスやウェアラブルデバイス向けに開発を行う開発者たちは、インフラのギャップに直面しています。ハードウェアは既に用意されていますが、AI体験を構築するには、ライブカメラおよびマイクストリームの統合、マルチモーダル AI モデル、エンタープライズデータ、ツール利用機能、デプロイメント基盤、そしてデバイス固有の実行環境を組み込む必要があります。

NVIDIA XR AI は、クラウド、データセンター、ワークステーション、またはエッジで動作する GPU 加速型 AI サービスに拡張現実（XR）デバイスを接続するための再利用可能な基盤を提供することで、この課題に対処するために設計されています。

現在ベータ版として一般公開されており、開発者は AIグラス、ARグラス、および XR ヘッドセット向けのインテリジェントエージェントを構築するためのオープンソースライブラリにアクセスできます。これらのインテリジェントな XR エージェントは、ユーザーが見ているものを視認し、話されたまたは入力された意図を理解し、エンタープライズツールを呼び出し、同じ XR セッション内で応答することができます。これらは最前線のチームメンバーが適切な情報を見つけるのを支援したり、作業者の手順をガイドしたり、結果を検証したり、証拠を記録したりできます。

XR AI は、フィールドサービス、リモートアシスタンス、産業運用、ヘルスケア、トレーニング、またはその他の手を使わなければならない環境など、人々が働く場所へ知能をもたらします。

ヘルスケアおよび製造業における NVIDIA パートナーは、このパターンをどのように適用できるかを示す有用な事例を提供しています。スタンフォード大学医学部の Cong ラボとプリンストン大学の Wang ラボの研究者たちは、幹細胞治療研究のための XR（拡張現実）および AI ワークフローを探求しており、複雑な手順に集中しながら文脈情報をアクセスし、ラボシステムと対話できることを支援しています。

製造業では、Siemens が研究コンテキストにおいて、NVIDIA XR AI と NVIDIA DGX Spark が工場エンジニアの保守情報検索、トラブルシューティング、作業検証、および現場での出来事の記録をどのように支援できるかを探索しています。

本稿では、ユースケースに適したインテリジェントな XR エージェント（拡張現実エージェント）を構築するプロセスを追跡します。また、NVIDIA Cosmos を用いたビジュアルグラウンディング、NVIDIA Nemotron モデルによる音声ファーストのインタラクション、Model Context Protocol (MCP) によるエンタープライズ接続性、および NVIDIA NeMo Agent Toolkit などのフレームワークを用いた柔軟なエージェントオーケストレーションを組み合わせる XR AI の仕組みについても探求します。

インテリジェントな XR エージェントの構成要素とアーキテクチャ

インテリジェント XR エージェントは、ユーザーの XR デバイスからのライブコンテキストから始まります。カメラフレーム、マイク音声、データメッセージが XR Media Hub に流れ込み、そこからユーザーの環境と意図を理解するモデル、ツール、エージェントへルーティングされます。NVIDIA Cosmos モデルはビジュアルグラウンディングを提供し、NVIDIA Nemotron モデルは言語理解、推論、およびツール呼び出しを提供します。また、MCP サーバーはエンタープライズツールやデータソースを公開します。NVIDIA NeMo Agent Toolkit などのエージェントフレームワークは、モデルとツールにわたるワークフローをオーケストレーションでき、アプリケーションが豊かな 3D インタラクションを必要とする場合には NVIDIA CloudXR がレンダリングされた空間コンテンツを追加できます。

XR AI は、メディア転送、モデルサービス、ツールアクセス、エージェントオーケストレーション、クライアント配信を分離することで、このアーキテクチャをモジュール化します。ビデオピクセルは共有メモリ内に留まりながら、軽量なメタデータがシステム内を移動するため、エージェントはタスクが必要とする場合にのみ画像データを取得します。これにより、不要なモデル推論とデータ移動が削減されつつも、開発者はエージェント全体を再構築することなく、クライアント、モデル、MCP サーバー、オーケストレーションフレームワーク、およびデプロイ環境を交換できます。

同じ設計は、マルチユーザーおよびマルチエージェントのシナリオもサポートします。参加者のアイデンティティがルーティングの境界として機能し、複数のクライアントが同じハブに接続でき、複数のエージェントが同じストリームを観察でき、各レスポンスは正しい参加者にルーティングされます。このパターンにより、1 つの基盤で視覚的理解、音声対話、エンタープライズツールの利用、リアルタイム推論、文脈認識 XR レスポンス、および AI グラス、AR グラス、XR ヘッドセット、モバイルデバイス、Web クライアント、CloudXR 対応エクスペリエンスにわたる柔軟なデプロイメントをサポートできます。

スタートガイド

XR AI は現在、パブリックベータ版として利用可能です。以下のセクションでは、XR AI を使用して動作するインテリジェント XR エージェントを迅速に構築する方法について説明します。具体的には以下が含まれます:

リアルタイムのカメラ、マイク、およびデバイスデータストリーム
リアルタイムのマルチモーダル対話
Cosmos 搭載の VLM（Vision Language Model: ビジョン言語モデル）による視覚的グラウンディング
音声認識と Nemotron モデルを活用した音声対話
MCP（Model Context Protocol: モデルコンテキストプロトコル）を介したエンタープライズ接続性
検索可能な視覚的知識のキャプチャおよび取得ワークフロー
NeMo Agent Toolkit またはその他のフレームワークによるオプションのエージェントオーケストレーション
オプションの CloudXR レンダリングされた空間コンテンツ

実装の詳細は業界によって異なりますが、基盤となるアーキテクチャは概ね同じです。

パブリックベータ版で最初のインテリジェント XR エージェントを構築する

ステップ 1. XR AI リポジトリのクローン

GitHub リポジトリには、サンプルエージェント、モデルサーバー起動スクリプト、MCP サーバー、Web クライアント、XR ワークフロー、およびコアメディアインフラストラクチャが含まれています。このシステムを理解する最速の方法は、単純なマルチモーダルエージェントから始め、機能を一つずつ追加していくことです。

bash

git clone https://github.com/NVIDIA/xr-ai.git

cd xr-ai

ステップ 2: AI サービスの起動

より大規模な例では、独立して起動可能な共有 AI サービスを使用します:

bash

cd agent-samples/model-servers

uv sync

uv run model_servers

これにより、重いデモで使用されるモデルプロセスが開始され、重み付け（weights）がバックグラウンドでロードされた状態になります。

現在のリポジトリにおけるモデルサーバースタックには以下が含まれます:

音声からテキストへの変換用: nvidia/parakeet-tdt-0.6b-v3
ビジョン・言語推論用: nvidia/Cosmos-Reason1-7B
高速でレイテンシに敏感な言語応答用: nvidia/Llama-3.1-Nemotron-Nano-8B-v1
より深いツール呼び出しワークフロー用: NVIDIA-Nemotron-3-Nano-30B-A3B

agent-sdk/xr-ai-models パッケージは、モデル層を柔軟に保ちます。ワーカーは設定を通じて llm、agent_llm、vlm（Vision-Language Model）、stt（Speech-to-Text）、tts（Text-to-Speech）などの論理サービスを参照するため、開発者はエージェントロジックを変更せずにエンドポイントの切り替え、クラウドホスト型モデルの使用、または OpenAI 互換 API の導入が可能になります。

視覚的理解、音声認識、言語推論、および音声応答を駆動するコア AI サービスは既に整備されています。

ステップ 3: センサーファースト XR エージェントの実行

最も単純な動作するエージェントを開始します:

bash cd agent-samples/simple-vlm-example uv sync uv run simple_vlm_example

サービス起動時、Web クライアントの URL と認証トークンが出力されます。

Web クライアントを開いて接続し、ping のようなプロンプトを送信するか、マイクを通じて質問を投げかけてください。

ワークフローは非常にシンプルです。

クライアントがカメラ映像、音声、データメッセージをストリーミングします。
XR AI がメディアを XR Media Hub を経由してルーティングします。
音声テキスト変換が行われます。
Cosmos に基づく VLM パスを使用して最新のカメラフレームが分析されます。
エージェントが応答を生成します。
応答はテキストと合成音声の両方として返されます。

これで動作する知的 XR エージェントが完成しました。ユーザーが見ているものを理解し、視覚的文脈に基づいて推論を行い、同じセッション内でテキストと音声の両方を通じて応答することができます。

エンタープライズシステムや RAG パイプライン、空間レンダリングを追加する前に、これは最も重要な機能である、ユーザー環境に根ざしたリアルタイムマルチモーダル相互作用を検証します。

ステップ 4. MCP を通じてエンタープライズデータに接続

多くのエンタープライズエージェントは、ライブな知覚能力だけでは不十分です。研究者にはプロトコルの手順や実験のメタデータ、データセットへのアクセスが必要になるかもしれません。現場の技術者にはメンテナンス記録が必要です。製造エンジニアには作業指示書、コントローラーの状態、またはデジタルツインの情報が必要となるでしょう。XR AI は、これらのワークフローに対する統合層として Model Context Protocol (MCP) を使用します。

リポジトリには XR 固有の機能に対応した MCP サーバーが含まれています。

vlm-mcp: ビジュアルクエスチョンアンスワー用

video-mcp：ビデオ分析およびクエリ用
render-mcp：シーン操作用
oxr-mcp：OpenXR 空間情報用
vec-mcp：ベクトルおよび空間ユーティリティ用
transcript-mcp：文字起こしの取り込みと検索用

開発者は、エンタープライズシステム、検索拡張生成 (RAG)、データベース、デジタルツイン、資産管理システム、ドメイン固有のワークフロー向けにカスタム MCP サーバーも構築できます。

多くの組織はまた、物理世界からの視覚情報を取得・理解することにも関心を持っています。XR エージェントは手順、検査、保守活動、または研究ワークフローを観察し、NVIDIA Video Search and Summarization (VSS) などの技術を使用して、その情報を後でインデックス付け、要約、および検索できるようにします。時間の経過とともに、これはレポート作成、トレーニング、コンプライアンス、運用レビュー、および検索拡張生成ワークフローを支援できる検索可能な視覚知識ベースを作成します。

ここでエージェントは知覚を超えて、エンタープライズアクションと組織的記憶へと移行し始めます。

ステップ 5. エージェントのオーケストレーションを追加する

以下の例は、NeMo Agent Toolkit MCP クライアントワークフローパターンから適応されたものです。実際には、この設定は NeMo Agent Toolkit ワークフロー定義内に存在し、エージェントが XR AI MCP サーバーによって公開されるツールを検出できるようにします。

function_groups:

xr_tools:

_type: mcp_client

server:

transport: streamable-http

workflow:

_type: react_agent

tool_names:

xr_tools

重要なのはフレームワークそのものではなく、XR AI がリアルタイムメディア、多モーダル知覚、エンタープライズ接続のための一貫した基盤を提供し、開発者が環境に最も適したオーケストレーションアプローチを選択できるようにすることです。

より高度なオーケストレーションワークフローに興味を持つ開発者は、MCP 統合、ツール呼び出し、マルチエージェントシステム、RAG ベースのワークフローに関する詳細な例を含む NeMo Agent Toolkit のドキュメントを参照してください。

ステップ 6. CloudXR レンダリングによる空間体験の追加

すべての XR ワークフローでレンダリングされた 3D コンテンツが必要とは限りません。一部のエージェントには、カメラ、マイク、言語機能、およびエンタープライズツールのみが必要です。ワークフローが空間可視化によって恩恵を受ける場合、XR AI はエージェント層を NVIDIA CloudXR と連携させることができます。

bash cd agent-samples/xr-render-demo uv sync uv run xr_render_demo

このワークフローは、XR Media Hub、CloudXR ランタイム、モデルサービス、MCP サーバー、およびエージェントワーカーを開始します。

エージェントは MCP を通じてレンダリングツールを呼び出し、ユーザーの空間環境内でオブジェクトを作成、更新、操作することができます。CloudXR は、GPU インフラストラクチャからクライアントデバイスへ結果としての体験をストリーミングします。

デモでは、有用な実運用パターンも示されています。小規模モデルが迅速な応答とステータス更新を処理し、大規模モデルがより深い推論やツール使用を担当します。ユーザーは即座にフィードバックを受け取りながら、より複雑な操作は背景で継続されます。この段階では、XR エージェントは物理環境とレンダリングされた空間コンテンツの両方と対話できます。

これで、ユースケースに合わせてカスタマイズ可能な動作する知的 XR エージェントが完成しました。さらに詳しく知りたい場合や、より深いパートナーシップについてお知りになりたい場合は、お気軽にお問い合わせください。

コードを入手する。

NVIDIA XR AI のドキュメントを読む。

Nemo Agent Toolkit を用いたエージェント構築についてさらに学ぶ。

CloudXR を使用した空間ストリーミングについてさらに学ぶ。

原文を表示

Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live camera and microphone streams, multimodal AI models, enterprise data, tool use, deployment infrastructure, and device-specific runtimes.

NVIDIA XR AIis designed to address this challenge by providing a reusable foundation for connecting extended reality (XR) devices to GPU-accelerated AI services running in the cloud, data center, workstation, or edge.

Now publicly available in beta, developers have access to an open source library for building intelligent agents for AI glasses, AR glasses, and XR headsets. These intelligent XR agents can see what users see, understand spoken or typed intent, call enterprise tools, and respond within the same XR session. They can help frontline team members find the right information, guide workers through procedures, verify outcomes, and capture the evidence.

XR AI brings intelligence to people where they work, whether in field service, remote assistance, industrial operations, healthcare, training, or other hands-busy environments.

NVIDIA partners in healthcare and manufacturing provide useful examples of how this pattern can be applied. Researchers in the Cong Lab at the Stanford School of Medicine and the Wang Lab at Princeton University have explored XR and AI workflows for stem cell therapy research, helping researchers access contextual information and interact with laboratory systems while remaining focused on complex procedures.

In manufacturing, Siemens is exploring in a research context how NVIDIA XR AI and NVIDIA DGX Spark can help factory engineers find maintenance information, troubleshoot issues, verify work, and capture what happened on the shop floor.

This post walks through the process of building an intelligent XR Agent for your use case. It also explores how XR AI combines visual grounding using NVIDIA Cosmos, voice-first interaction with NVIDIA Nemotron models, enterprise connectivity using Model Context Protocol (MCP), and flexible agent orchestration with frameworks such as NVIDIA NeMo Agent Toolkit.

Components and architecture of an intelligent XR Agent

An intelligent XR Agent starts with live context from the user’s XR device. Camera frames, microphone audio, and data messages flow into the XR Media Hub, where they can be routed to models, tools, and agents that understand the user’s environment and intent. NVIDIA Cosmos models provide visual grounding; NVIDIA Nemotron models provide language understanding, reasoning, and tool calling; and MCP servers expose enterprise tools and data sources. Agent frameworks such as NVIDIA NeMo Agent Toolkit can orchestrate workflows across models and tools, while NVIDIA CloudXR can add rendered spatial content when an application needs rich 3D interaction.

XR AI keeps this architecture modular by separating media transport, model services, tool access, agent orchestration, and client delivery. Video pixels can remain in shared memory while lightweight metadata moves through the system, so agents retrieve image data only when a task requires it. This reduces unnecessary model inference and data movement while letting developers swap clients, models, MCP servers, orchestration frameworks, and deployment environments without rebuilding the entire agent.

The same design also supports multi-user and multi-agent scenarios. Participant identity acts as the routing boundary: multiple clients can connect to the same hub, multiple agents can observe the same streams, and each response is routed back to the correct participant. This pattern enables one foundation to support visual understanding, voice interaction, enterprise tool use, real-time reasoning, context-aware XR responses, and flexible deployment across AI glasses, AR glasses, XR headsets, mobile devices, web clients, and CloudXR-powered experiences.

Get started

XR AI is now available in public beta. The following sections walk through how you can use XR AI to quickly get to a working intelligent XR Agent, including:

Live camera, microphone, and device data streams

Real-time multimodal interaction

Visual grounding through Cosmos-powered VLMs

Voice interaction through speech recognition and Nemotron models

Enterprise connectivity through MCP

Searchable visual knowledge capture and retrieval workflows

Optional agent orchestration through NeMo Agent Toolkit or other frameworks

Optional CloudXR-rendered spatial content

While implementation details vary across industries, the underlying architecture remains largely the same.

Build your first intelligent XR agent with the public beta

Step 1. Clone the XR AI repository

The GitHub repository includes sample agents, model-server launchers, MCP servers, web clients, XR workflows, and the core media infrastructure. The quickest way to understand the system is to start with a simple multimodal agent and then add capabilities one layer at a time.

bash git clone https://github.com/NVIDIA/xr-ai.git cd xr-ai

Step 2. Start the AI services

The larger examples use shared AI services that can be started independently:

bash cd agent-samples/model-servers uv sync uv run model_servers

This starts the model processes used by the heavier demos and leaves the weights loaded in the background.

In the current repository, the model server stack includes:

nvidia/parakeet-tdt-0.6b-v3 for speech-to-text

nvidia/Cosmos-Reason1-7B for vision-language reasoning

nvidia/Llama-3.1-Nemotron-Nano-8B-v1 for fast, latency-sensitive language responses

NVIDIA-Nemotron-3-Nano-30B-A3B for deeper tool-calling workflows

The agent-sdk/xr-ai-models package keeps the model layer flexible. Workers reference logical services such as llm, agent_llm, vlm, stt, and tts through configuration, letting developers swap endpoints, use cloud-hosted models, or bring OpenAI-compatible APIs without changing agent logic.

The core AI services to power visual understanding, speech recognition, language reasoning, and voice responses are in place.

Step 3. Run a sensor-first XR agent

Start the simplest working agent:

bash cd agent-samples/simple-vlm-example uv sync uv run simple_vlm_example

When the service starts, it prints a web client URL and authentication token.

Open the web client, connect, and send a prompt such as ping or ask a question through the microphone.

The workflow is straightforward:

The client streams camera, microphone, and data messages.

XR AI routes media through the XR Media Hub.

Speech is converted to text.

The latest camera frame is analyzed using the Cosmos-powered VLM path.

The agent generates a response.

The response returns as both text and synthesized audio.

This is now a working intelligent XR agent. It can listen, understand what the user sees, reason over visual context, and respond through the same session using both text and speech.

Before adding enterprise systems, RAG pipelines, or spatial rendering, this validates the most important capability: real-time multimodal interaction grounded in the user’s environment.

Step 4. Connect enterprise data through MCP

Most enterprise agents need more than live perception. A researcher may need protocol steps, experiment metadata, or dataset access. A field technician may need maintenance records. A manufacturing engineer may need work instructions, controller state, or digital-twin information. XR AI uses Model Context Protocol (MCP) as the integration layer for these workflows.

The repository includes MCP servers for XR-specific capabilities:

vlm-mcp for visual question answering

video-mcp for video analysis and queries

render-mcp for scene manipulation

oxr-mcp for OpenXR spatial information

vec-mcp for vector and spatial utilities

transcript-mcp for transcript ingestion and retrieval

Developers can also build custom MCP servers for enterprise systems, retrieval-augmented generation (RAG), databases, digital twins, asset-management systems, and domain-specific workflows.

Many organizations are also interested in capturing and understanding visual information from the physical world. An XR agent can observe procedures, inspections, maintenance activities, or research workflows, then use technologies such as NVIDIA Video Search and Summarization (VSS) to index, summarize, and retrieve that information later. Over time, this creates a searchable visual knowledge base that can support reporting, training, compliance, operational reviews, and retrieval-augmented generation workflows.

This is where the agent begins to move beyond perception and into enterprise action and organizational memory.

Step 5. Add agent orchestration

The following example is adapted from the NeMo Agent Toolkit MCP client workflow pattern. In practice, this configuration would live inside a NeMo Agent Toolkit workflow definition and enable the agent to discover tools exposed by XR AI MCP servers.

function_groups:

xr_tools:

_type: mcp_client

server:

transport: streamable-http

workflow:

_type: react_agent

tool_names:

- xr_tools

The important point isn’t the framework, but that XR AI provides a consistent foundation for real-time media, multimodal perception, and enterprise connectivity while enabling developers to choose the orchestration approach that best fits their environment.

Developers interested in more advanced orchestration workflows should review the NeMo Agent Toolkit documentation, which includes detailed examples for MCP integration, tool calling, multi-agent systems, and RAG-based workflows.

Step 6. Add CloudXR-rendered spatial experiences

Not every XR workflow requires rendered 3D content. Some agents only need a camera, microphone, language, and enterprise tools. When a workflow benefits from spatial visualization, XR AI can pair the agent layer with NVIDIA CloudXR.

bash cd agent-samples/xr-render-demo uv sync uv run xr_render_demo

This workflow launches the XR Media Hub, CloudXR runtime, model services, MCP servers, and an agent worker.

The agent can call rendering tools through MCP to create, update, and manipulate objects in a user’s spatial environment. CloudXR streams the resulting experience from GPU infrastructure to the client device.

The demo also shows a useful production pattern. A smaller model handles rapid acknowledgments and status updates while a larger model performs deeper reasoning and tool use. Users receive immediate feedback while more complex operations continue in the background. At this stage, the XR agent can interact with both the physical environment and rendered spatial content.

You now have a working intelligent XR agent, ready to customize to your use case. You can also learn more or reach out to us for a deeper partnership.

Get the code.

Read the NVIDIA XR AI documentation.

Learn more about building agents with Nemo Agent Toolkit.

Learn more about spatial streaming using CloudXR.

この記事をシェア

TLDR AI★42026年6月2日 09:00

動画エージェントモデルが次世代へ — xAI のイーサン・ヘ氏に聞く Grok Imagine の開発秘話（98 分読み）

Nvidia コスモス世界モデルの元リーダー、イーサン・ヘ氏が xAI に移籍し、3 ヶ月で「Grok Image」を構築した経緯について、動画生成やマルチモーダル技術の最前線における実務の核心を語っている。

Hugging Face Blog★42026年6月5日 03:57

Nemotron 3.5 コンテンツセーフティ：グローバル企業向けカスタマイズ可能なマルチモーダル安全性

Hugging Face は、Nemotron 3.5 のコンテンツセーフティ機能を発表し、グローバル企業の AI 利用に向けたカスタマイズ可能なマルチモーダル安全性を提供する。

Hugging Face Blog★42026年6月4日 21:59

あなたの言語・ドメイン、またはアクセント向けに Nemotron 3.5 ASR をファインチューニングする方法

Hugging Face は、Nemotron 3.5 ASR モデルを特定の言語や業界ドメイン、話者のアクセントに合わせてカスタマイズするファインチューニングの手順を解説した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月18日 09:00·約13分で読める

NVIDIA XR AI を用いた AR グラスおよび XR デバイス向け AI エージェントの構築

#XR Agents #NVIDIA Cosmos #Nemotron #NeMo Agent Toolkit #Multimodal AI #Edge Computing

TL;DR

AI深層分析2026年6月18日 17:07

重要/ 5段階

深度40%

キーポイント

開発インフラのギャップ解消と XR AI の登場

インテリジェント XR エージェントの機能とアーキテクチャ

医療・製造業における具体的な活用事例

モジュラーな XR AI アーキテクチャ

マルチユーザー・マルチエージェント対応

包括的な機能サポートとベータ版公開

柔軟なモデルアーキテクチャ

影響分析・編集コメントを表示

影響分析

編集コメント

インテリジェントな XR エージェントの構成要素とアーキテクチャ

スタートガイド

リアルタイムのカメラ、マイク、およびデバイスデータストリーム
リアルタイムのマルチモーダル対話
Cosmos 搭載の VLM（Vision Language Model: ビジョン言語モデル）による視覚的グラウンディング
音声認識と Nemotron モデルを活用した音声対話
MCP（Model Context Protocol: モデルコンテキストプロトコル）を介したエンタープライズ接続性
検索可能な視覚的知識のキャプチャおよび取得ワークフロー
NeMo Agent Toolkit またはその他のフレームワークによるオプションのエージェントオーケストレーション
オプションの CloudXR レンダリングされた空間コンテンツ

実装の詳細は業界によって異なりますが、基盤となるアーキテクチャは概ね同じです。

パブリックベータ版で最初のインテリジェント XR エージェントを構築する

ステップ 1. XR AI リポジトリのクローン

bash

git clone https://github.com/NVIDIA/xr-ai.git

cd xr-ai

ステップ 2: AI サービスの起動

より大規模な例では、独立して起動可能な共有 AI サービスを使用します:

bash

cd agent-samples/model-servers

uv sync

uv run model_servers

これにより、重いデモで使用されるモデルプロセスが開始され、重み付け（weights）がバックグラウンドでロードされた状態になります。

現在のリポジトリにおけるモデルサーバースタックには以下が含まれます:

音声からテキストへの変換用: nvidia/parakeet-tdt-0.6b-v3
ビジョン・言語推論用: nvidia/Cosmos-Reason1-7B
高速でレイテンシに敏感な言語応答用: nvidia/Llama-3.1-Nemotron-Nano-8B-v1
より深いツール呼び出しワークフロー用: NVIDIA-Nemotron-3-Nano-30B-A3B

視覚的理解、音声認識、言語推論、および音声応答を駆動するコア AI サービスは既に整備されています。

ステップ 3: センサーファースト XR エージェントの実行

最も単純な動作するエージェントを開始します:

bash cd agent-samples/simple-vlm-example uv sync uv run simple_vlm_example

サービス起動時、Web クライアントの URL と認証トークンが出力されます。

Web クライアントを開いて接続し、ping のようなプロンプトを送信するか、マイクを通じて質問を投げかけてください。

ワークフローは非常にシンプルです。

クライアントがカメラ映像、音声、データメッセージをストリーミングします。
XR AI がメディアを XR Media Hub を経由してルーティングします。
音声テキスト変換が行われます。
Cosmos に基づく VLM パスを使用して最新のカメラフレームが分析されます。
エージェントが応答を生成します。
応答はテキストと合成音声の両方として返されます。

ステップ 4. MCP を通じてエンタープライズデータに接続

リポジトリには XR 固有の機能に対応した MCP サーバーが含まれています。

vlm-mcp: ビジュアルクエスチョンアンスワー用

video-mcp：ビデオ分析およびクエリ用
render-mcp：シーン操作用
oxr-mcp：OpenXR 空間情報用
vec-mcp：ベクトルおよび空間ユーティリティ用
transcript-mcp：文字起こしの取り込みと検索用

ここでエージェントは知覚を超えて、エンタープライズアクションと組織的記憶へと移行し始めます。

ステップ 5. エージェントのオーケストレーションを追加する

function_groups:

xr_tools:

_type: mcp_client

server:

transport: streamable-http

workflow:

_type: react_agent

tool_names:

xr_tools

ステップ 6. CloudXR レンダリングによる空間体験の追加

bash cd agent-samples/xr-render-demo uv sync uv run xr_render_demo

このワークフローは、XR Media Hub、CloudXR ランタイム、モデルサービス、MCP サーバー、およびエージェントワーカーを開始します。

コードを入手する。

NVIDIA XR AI のドキュメントを読む。

Nemo Agent Toolkit を用いたエージェント構築についてさらに学ぶ。

CloudXR を使用した空間ストリーミングについてさらに学ぶ。

原文を表示

XR AI brings intelligence to people where they work, whether in field service, remote assistance, industrial operations, healthcare, training, or other hands-busy environments.

Components and architecture of an intelligent XR Agent

Get started

XR AI is now available in public beta. The following sections walk through how you can use XR AI to quickly get to a working intelligent XR Agent, including:

Live camera, microphone, and device data streams

Real-time multimodal interaction

Visual grounding through Cosmos-powered VLMs

Voice interaction through speech recognition and Nemotron models

Enterprise connectivity through MCP

Searchable visual knowledge capture and retrieval workflows

Optional agent orchestration through NeMo Agent Toolkit or other frameworks

Optional CloudXR-rendered spatial content

While implementation details vary across industries, the underlying architecture remains largely the same.

Build your first intelligent XR agent with the public beta

Step 1. Clone the XR AI repository

bash git clone https://github.com/NVIDIA/xr-ai.git cd xr-ai

Step 2. Start the AI services

The larger examples use shared AI services that can be started independently:

bash cd agent-samples/model-servers uv sync uv run model_servers

This starts the model processes used by the heavier demos and leaves the weights loaded in the background.

In the current repository, the model server stack includes:

nvidia/parakeet-tdt-0.6b-v3 for speech-to-text

nvidia/Cosmos-Reason1-7B for vision-language reasoning

nvidia/Llama-3.1-Nemotron-Nano-8B-v1 for fast, latency-sensitive language responses

NVIDIA-Nemotron-3-Nano-30B-A3B for deeper tool-calling workflows

The core AI services to power visual understanding, speech recognition, language reasoning, and voice responses are in place.

Step 3. Run a sensor-first XR agent

Start the simplest working agent:

bash cd agent-samples/simple-vlm-example uv sync uv run simple_vlm_example

When the service starts, it prints a web client URL and authentication token.

Open the web client, connect, and send a prompt such as ping or ask a question through the microphone.

The workflow is straightforward:

The client streams camera, microphone, and data messages.

XR AI routes media through the XR Media Hub.

Speech is converted to text.

The latest camera frame is analyzed using the Cosmos-powered VLM path.

The agent generates a response.

The response returns as both text and synthesized audio.

This is now a working intelligent XR agent. It can listen, understand what the user sees, reason over visual context, and respond through the same session using both text and speech.

Before adding enterprise systems, RAG pipelines, or spatial rendering, this validates the most important capability: real-time multimodal interaction grounded in the user’s environment.

Step 4. Connect enterprise data through MCP

The repository includes MCP servers for XR-specific capabilities:

vlm-mcp for visual question answering

video-mcp for video analysis and queries

render-mcp for scene manipulation

oxr-mcp for OpenXR spatial information

vec-mcp for vector and spatial utilities

transcript-mcp for transcript ingestion and retrieval

Developers can also build custom MCP servers for enterprise systems, retrieval-augmented generation (RAG), databases, digital twins, asset-management systems, and domain-specific workflows.

This is where the agent begins to move beyond perception and into enterprise action and organizational memory.

Step 5. Add agent orchestration

function_groups:

xr_tools:

_type: mcp_client

server:

transport: streamable-http

workflow:

_type: react_agent

tool_names:

- xr_tools

Step 6. Add CloudXR-rendered spatial experiences

bash cd agent-samples/xr-render-demo uv sync uv run xr_render_demo

This workflow launches the XR Media Hub, CloudXR runtime, model services, MCP servers, and an agent worker.

You now have a working intelligent XR agent, ready to customize to your use case. You can also learn more or reach out to us for a deeper partnership.

Get the code.

Read the NVIDIA XR AI documentation.

Learn more about building agents with Nemo Agent Toolkit.

Learn more about spatial streaming using CloudXR.

この記事をシェア

TLDR AI★42026年6月2日 09:00

動画エージェントモデルが次世代へ — xAI のイーサン・ヘ氏に聞く Grok Imagine の開発秘話（98 分読み）

Hugging Face Blog★42026年6月5日 03:57

Nemotron 3.5 コンテンツセーフティ：グローバル企業向けカスタマイズ可能なマルチモーダル安全性

Hugging Face Blog★42026年6月4日 21:59

あなたの言語・ドメイン、またはアクセント向けに Nemotron 3.5 ASR をファインチューニングする方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA XR AI を用いた AR グラスおよび XR デバイス向け AI エージェントの構築

キーポイント

影響分析

編集コメント

インテリジェントな XR エージェントの構成要素とアーキテクチャ

スタートガイド

パブリックベータ版で最初のインテリジェント XR エージェントを構築する

Components and architecture of an intelligent XR Agent

Get started

Build your first intelligent XR agent with the public beta

関連記事

NVIDIA XR AI を用いた AR グラスおよび XR デバイス向け AI エージェントの構築

キーポイント

影響分析

編集コメント

インテリジェントな XR エージェントの構成要素とアーキテクチャ

スタートガイド

パブリックベータ版で最初のインテリジェント XR エージェントを構築する

Components and architecture of an intelligent XR Agent

Get started

Build your first intelligent XR agent with the public beta

関連記事