NVIDIA Developer Blog·2026年4月2日 00:00·約1分

統一サービスとリアルタイムAIによるAIファクトリーでのトークン生成の加速

#AIファクトリ管理 #モジュール型アーキテクチャ #NVIDIA Mission Control 3.0 #マルチテナント分離 #予測型AIOps

TL;DR

NVIDIAはAIファクトリの運用効率を向上させるため、統合管理プラットフォーム「Mission Control 3.0」を発表し、モジュール型アーキテクチャと予測型AIOpsでトークン生産の最大化を図る。

AI深層分析2026年4月2日 01:42

重要/ 5段階

深度40%

キーポイント

モジュール型API駆動アーキテクチャへの移行

従来の緊密結合型スタックから階層化されたAPI駆動設計へ変更し、ハードウェア依存を低減して最新GPUへの迅速なサポートとカスタマイズを可能にした。

マルチテナント環境における組織分離と仮想化

管理サービスを物理ノードから分離しKVMベースの仮想プラットフォームで展開することで、大規模AIファクトリにおける安全なマルチテナント運用と組織間分離を実現した。

予測型AIOpsとインテリジェントな電力オーケストレーション

運用上の異常を事前に検知する予測型AIOpsと、ラックレベルの電力制約を最適化するドメイン電力サービスにより、トークン生産効率とリソース利用率を最大化する。

OEMおよびISV向けエコシステム統合の促進

オープンコンポーネントとモジュール設計を採用し、OEMやISVが自らの管理エコシステムへMission Controlの機能を直接組み込みやすい環境を提供した。

ソフトウェア定義型仮想化制御プレーン

Mission Controlの管理プレーンをKVMベースの仮想化プラットフォームに移行し、物理ノードから分離することで、ネットワーク共有とテナント間隔離を両立する柔軟なアーキテクチャを実現。

マルチテナント対応によるTCO削減

複数の組織が共有インフラを安全に利用できるようになり、個別クラスターの購入・運用コストを抑えつつ、強力な隔離とセルフサービスを提供する。

電力管理を第1級スケジューリングプリミティブへ

Mission Control 3.0でドメイン電力サービスをスケジューラに統合し、Slurm/Kubernetes/Run:ai環境全体で電力制約を先読みしてワークロード配置を最適化する。

影響分析・編集コメントを表示

影響分析

大規模AI運用企業は、ハードウェアの物理的限界を超え、ソフトウェア定義のオーケストレーションによってリソース最適化を実現する必要がある。本リリースは、AIファクトリの運用管理における業界標準を再定義し、次世代のインフラストラクチャ競争の焦点をハードウェア性能からソフトウェア制御へシフトさせる。

編集コメント

ハードウェア性能の競争が頂点に達する中、ソフトウェア定義の運用管理プラットフォームが次なる差別化要因となる。NVIDIAは自社エコシステムへの統合を促しつつ、業界標準の管理インターフェースを確立しようとしている。

image今日のAIファクトリー環境において、パフォーマンスは理論上の問題ではありません。それは経済性、競争力、そして事業の存続に関わる問題です。使用可能なGPU時間が1%低下するだけで...

原文を表示

In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean millions of tokens lost per hour. Minutes of congestion can cascade into hours of recovery. A rack-level power oversubscription can lead to stranded power and reduced tokens per watt, silently eroding factory output at scale. As AI factories scale to thousands of GPUs running diverse mission critical workloads, the cost of unpredictable congestion, power constraints, long-tail latency, and limited visibility grows exponentially.

Operations teams and administrators need more than dashboards. They need flexibility and foresight.

NVIDIA launched NVIDIA Mission Control as an integrated software stack for AI factories built on NVIDIA reference architectures, codifying NVIDIA best practices with a unified control plane. Mission Control version 3.0 expands further, introducing architectural flexibility, multi-org isolation, intelligent power orchestration and predictive AIOps to detect anomalies in operations and maximize token production.

Figure 1. NVIDIA Mission Control provides a validated software stack with services for operational agility, monitoring, and resiliency.

Flexible software that unlocks velocity

NVIDIA Mission Control 3.0 provides newfound agility by introducing a new layered, API-driven architecture built on modular services, improving the previously tightly coupled stacks that required synchronized releases and complex validation across hardware platforms. New components such as automated network management and domain power service, which provides a new management plane for power optimizations, further extend the Mission Control stack by bringing additional modular services into the singular control plane.

By combining open components with a modular design, this enables rapid support for the latest NVIDIA hardware while allowing OEM system providers and independent software vendors (ISVs) to integrate Mission Control capabilities directly into their own ecosystems. This creates an outcome where enterprises now have more flexibility and choice in their own software stacks, making it easier to customize solutions to meet their unique business and technology challenges.

Isolation in a multi-tenant world

One technological challenge many organizations face is supporting multi-org isolation within a centralized AI factory. As AI factories evolve from research and experimentation into production-grade, mission-critical environments, shared infrastructure across multiple teams requires strong organizational isolation and secure multi-tenancy.

The enhanced Mission Control control plane transforms the AI factory management stack into a software-defined, virtualized architecture. Mission Control services are decoupled from physical management nodes and deployed on Virtual Machine (KVM)-based platforms using NVIDIA-provided automation. While compute racks and management nodes are dedicated per org, network switches are shared and require additional isolation for multi-tenancy. The shared fabric architecture of NVIDIA Spectrum-X Ethernet is logically segmented using VXLAN and NVIDIA Quantum InfiniBand is segmented using PKeys.

Figure 2. A multi-org deployment with NVIDIA Mission Control uses virtualization and a dedicated compute and control plane for each organization requiring network isolation.

This architecture reduces physical management infrastructure footprint, establishes hard tenant isolation, and creates a secure foundation for multi-organization AI factories. This in turn lowers the total cost of ownership by allowing operators the flexibility to onboard multiple orgs onto shared infrastructure, reducing the need to buy and operate multiple clusters lowering physical footprint, while still providing each org with strong isolation and self-service.

Power: The invisible constraint

Another growing concern for AI factory token production is fixed power envelopes due to economic constraints such as fixed utilities and regulatory compliance. Each GPU generation delivers more performance, but facility power is naturally limited by a combination of the existing data center infrastructure and available power grid. The challenge is clear: How do you increase token output and rack density without exceeding power limits?

The power management in previous iterations of Mission Control helped organizations responsibly manage complex power considerations, but it was reactive. Jobs were scheduled first; power policies were enforced afterward. While this was a huge step for balancing power and performance, more dynamic solutions were needed to manage this at scale, especially across mixed Slurm and Kubernetes environments. This is where Mission Control evolves with version 3.0.

By incorporating domain power service directly into Mission Control, power becomes a first-class scheduling primitive that helps organizations optimize token production with their power policies. This power management service enables power-aware workload placement across traditional Slurm workloads or Kubernetes-native workloads being orchestrated by NVIDIA Run:ai, which is integrated and included into the Mission Control stack. Domain power service also supports MAX-P and MAX-Q profiles for training and inference, provides rack- and topology-aware reservation steering by leveraging Mission Control integration with facility building management systems.

Figure 3. NVIDIA Mission Control uses domain power service for comprehensive power management that continuously monitors and optimizes power utilization in the AI factory.

In one example where NVIDIA had MAX-Q profile in operation, domain power service allowed the data center to run at 85% power with only 7% throughput loss. It was able to achieve this by dynamically leveraging the power profiles integrated by Mission Control.

The integration empowers data center operators to define facility constraints and AI practitioners can confidently select performance or efficiency modes aligned to their workload priorities. Governance remains centralized while flexibility ensures AI factories can be tuned for best performance per watt and performance per dollar.

From dashboards to real-time decisions

In addition to providing new services for dynamic power management, Mission Control version 3.0 enhances existing anomaly detection capabilities by integrating with NVIDIA AIOps Collector and Platform Stacks (NACPS) for AI-powered predictive anomaly detection. At the core of NACPS is the AI cluster model, a graph-based representation of infrastructure and workloads that creates a topology-aware view across GPUs, NVIDIA NVLink scale-up, NVIDIA Spectrum-X Ethernet or NVIDIA Quantum InfiniBand East-West scale-out and NVIDIA BlueField DPU North-South networking. This view is combined with job topology in the cluster model.

Figure 4. NVIDIA AIOps Collector and Platform Stacks (NACPS) provides AI-powered predictive anomaly detection as a part of NVIDIA Mission Control 3.0. It collects data from the AI factory agent and combines machine learning and correlation to send back predictive workflows and remediations to the AI factory.

NACPS combines unsupervised online machine learning on metrics, natural language processing (NLP)-based analysis of logs to detect unknown issues, supervised learning trained on labeled incidents, and deterministic rule-based guardrails.

Telemetry streams continuously from GPUs, switches, hosts, network interface cards (NICs) and schedulers into NACPS. Events and anomalies are automatically correlated across layers, enabling context-driven root cause analysis while reducing alert noise. Instead of isolated metrics, the system understands relationships.

When anomalies are detected, Mission Control can trigger automated remediation workflows from automated hardware recovery that works in concert with Slurm integration in NVIDIA Base Command Manager or NVIDIA Run:ai for Kubernetes workloads.

The system doesn’t just monitor infrastructure. It understands it and acts on it.

Operators no longer need to chase symptoms. They gain foresight.

A different kind of KPI: Utilization vs. token production

As AI factory operations continue to evolve, operation teams need to consider a different kind of KPI. Traditional datacenters were optimized for utilization, but AI factories need to be optimized for token production.

In order for AI factories to be optimized for token production, enterprises need to consider metrics such as: token production per GPU and per rack, as well as token production per watt and megawatt. Every inefficiency directly reduces overall token output. If congestion in the network fabric isn’t detected and mitigated, or a single rack unexpectedly exceeds its power constraint, or a compute node experiences an anomaly mid-job — the AI factory loses out on token generation and potential revenue.

However, when the AI factory is operating intelligently, it is able to convert every megawatt into tokens with precision, maximizing output.

Get started with Mission Control

Mission Control 3.0 is designed around minimizing inefficiencies and increasing token output for AI factory operators. By correlating telemetry across domains, orchestrating power intelligently, modularizing the architecture for agility, and enhancing autonomous remediation with AI, it transforms infrastructure from a passive platform into an active participant in performance optimization.

Resources:

Solution overview

Latest release notes

Stay tuned for our latest release notes and implementation guides for NVIDIA Mission Control 3.0.You can also check out the on-demand replay for the NVIDIA GTC 2026 session with Eli Lilly & Company to hear firsthand insights into architecting and deploying high-performance AI infrastructure with powerful, intelligent software.

この記事をシェア

NVIDIA Developer Blog重要度42026年4月3日 01:27

Gemma 4でAIをエッジおよびオンデバイスに近づける

NVIDIA Developer Blog重要度42026年4月3日 01:30

資本市場向けに一桁マイクロ秒レベルの推論遅延を実現

NVIDIA Developer Blog2026年4月3日 05:00

バッチモードVC-6とNVIDIA NsightによるビジョンAIパイプラインの高速化

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年4月2日 00:00·約1分

統一サービスとリアルタイムAIによるAIファクトリーでのトークン生成の加速

#AIファクトリ管理 #モジュール型アーキテクチャ #NVIDIA Mission Control 3.0 #マルチテナント分離 #予測型AIOps

TL;DR

AI深層分析2026年4月2日 01:42

重要/ 5段階

深度40%

キーポイント

モジュール型API駆動アーキテクチャへの移行

マルチテナント環境における組織分離と仮想化

予測型AIOpsとインテリジェントな電力オーケストレーション

OEMおよびISV向けエコシステム統合の促進

オープンコンポーネントとモジュール設計を採用し、OEMやISVが自らの管理エコシステムへMission Controlの機能を直接組み込みやすい環境を提供した。

ソフトウェア定義型仮想化制御プレーン

マルチテナント対応によるTCO削減

電力管理を第1級スケジューリングプリミティブへ

影響分析・編集コメントを表示

影響分析

編集コメント

原文を表示

Operations teams and administrators need more than dashboards. They need flexibility and foresight.

Figure 1. NVIDIA Mission Control provides a validated software stack with services for operational agility, monitoring, and resiliency.

Flexible software that unlocks velocity

Isolation in a multi-tenant world

Figure 2. A multi-org deployment with NVIDIA Mission Control uses virtualization and a dedicated compute and control plane for each organization requiring network isolation.

Power: The invisible constraint

Figure 3. NVIDIA Mission Control uses domain power service for comprehensive power management that continuously monitors and optimizes power utilization in the AI factory.

From dashboards to real-time decisions

The system doesn’t just monitor infrastructure. It understands it and acts on it.

Operators no longer need to chase symptoms. They gain foresight.

A different kind of KPI: Utilization vs. token production

However, when the AI factory is operating intelligently, it is able to convert every megawatt into tokens with precision, maximizing output.

Get started with Mission Control

Resources:

Solution overview

Latest release notes

この記事をシェア

NVIDIA Developer Blog重要度42026年4月3日 01:27

Gemma 4でAIをエッジおよびオンデバイスに近づける

NVIDIA Developer Blog重要度42026年4月3日 01:30

資本市場向けに一桁マイクロ秒レベルの推論遅延を実現

NVIDIA Developer Blog2026年4月3日 05:00

バッチモードVC-6とNVIDIA NsightによるビジョンAIパイプラインの高速化

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

統一サービスとリアルタイムAIによるAIファクトリーでのトークン生成の加速

キーポイント

影響分析

編集コメント

関連記事

統一サービスとリアルタイムAIによるAIファクトリーでのトークン生成の加速

キーポイント

影響分析

編集コメント

関連記事