ラックスケールスーパーコンピュータでのAIワークロード実行:ハードウェアからトポロジ対応スケジューリングまで
NVIDIAは、Blackwellアーキテクチャを採用したラックスケールスーパーコンピュータGB200 NVL72およびGB300 NVL72システムを発表し、大規模AIワークロードを効率的に実行するためのハードウェアからトポロジを考慮したスケジューリングまでの包括的なソリューションを提供する。
キーポイント
新世代ラックスケールスーパーコンピュータの発表
NVIDIAはBlackwellアーキテクチャを採用したGB200 NVL72およびGB300 NVL72システムを発表し、これらは大規模AIワークロードを実行するために設計されたラックスケールスーパーコンピュータである。
ハードウェアからスケジューリングまでの包括的ソリューション
記事は、これらのシステムが単なるハードウェアの提供ではなく、トポロジを考慮したスケジューリングを含む、AIワークロードを効率的に実行するための包括的なソリューションであることを強調している。
大規模AIワークロードの効率化
これらのシステムは、大規模なAIモデルのトレーニングや推論など、膨大な計算リソースを必要とするワークロードの効率的な実行を可能にすることを目的としている。
影響分析・編集コメントを表示
影響分析
この発表は、大規模AIモデルの開発と運用に必要な計算インフラの新たな基準を示すものであり、AI研究と産業応用の加速に寄与する可能性が高い。NVIDIAのハードウェア優位性を、システム全体の効率性という観点でさらに強化する戦略的な動きと見ることができる。
編集コメント
NVIDIAの開発者ブログからの情報であり、自社製品の紹介が主ではあるが、大規模AIワークロードを支えるインフラの進化を示す重要なニュースとして捉えるべき。
imageNVIDIA Blackwellアーキテクチャを採用したNVIDIA GB200 NVL72およびNVIDIA GB300 NVL72システムは、ラックスケールスーパーコンピュータです。これらは18...
原文を表示
The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit.
For AI architects and HPC platform operators, the challenge isn’t just racking and stacking hardware—it’s turning infrastructure into safe, performant, and easy-to-use resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions is where most of the operational complexity lives. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system’s hierarchical and topology-sensitive design.
This is the gap that a validated software stack, such as NVIDIA Mission Control, is designed to bridge. Mission Control provides rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems. With a native understanding of NVIDIA NVLink and NVIDIA IMEX domains, it integrates with workload management platforms like Slurm and NVIDIA Run:ai. These capabilities will also be supported for the NVIDIA Vera Rubin platform, including for NVIDIA Rubin NVL8.
This post demonstrates how Mission Control, Slurm, and NVIDIA Run:ai turn advanced GPU architecture concepts—such as NVLink and IMEX domains—into an operational AI factory that is scalable, schedulable, and easy to manage.
The core challenge: Rack-scale topology meets AI workload scheduling
Figure 1. Diagram of NVLink spine that explains the relationships between the NVLink and IMEX domains and NVLink Partition
At a physical level, GB300 NVL72 and GB200 NVL72 systems are powerful, sophisticated systems. Each delivers a dense GPU fabric connected by NVLink switches, supports NVIDIA Multi-Node NVLink (MNNVL) within the rack, and includes IMEX-capable compute trays that enable shared GPU memory across nodes.
Schedulers, however, don’t operate at the level of switches and fabrics. They require:
Discrete GPU resource pools can allocate predictably.
Clear isolation boundaries to protect workloads from one another.
Consistent performance characteristics that match user expectations.
Figure 2. NVLink core concepts and highlights
Under the hood, the NVLink topology of a Grace Blackwell NVL72 rack is reflected up the software stack through a pair of system-level identifiers: cluster UUID and clique ID.
These identifiers encode a GPU’s position in the NVLink fabric—across domains or racks—in a way that system software, schedulers, and higher-level tooling can reason about.
The mapping is straightforward:
Cluster UUID corresponds to the NVLink domain.
Clique ID corresponds to the NVLink partition.
A shared cluster UUID means that systems—and their GPUs—belong to the same NVLink domain and are connected by a common NVLink fabric. On Grace Blackwell NVL72, this UUID is consistent across the entire rack: all GPUs in the same NVL72 rack report the same cluster UUID.
The clique ID provides a finer-grained distinction. GPUs that share a clique ID belong to the same NVLink Partition within that domain. When a rack is carved into multiple NVLink partitions, the cluster UUID remains the same—because the GPUs live in the same physical NVLink domain—but the clique IDs differ to reflect the logical partitioning of the fabric.
From an operational perspective, this distinction matters:
Cluster UUID answers: Which GPUs physically share a rack and are capable of NVLink communication?
Clique ID answers: Which GPUs share an NVLink Partition and are intended to communicate together for a given workload or service tier?
Figure 3. Example of how NVIDIA Mission Control centralizes the view of cluster UUID and clique ID across a managed environment
These identifiers form the connective tissue between hardware topology and scheduling logic. They enable platforms like Slurm, Kubernetes, and NVIDIA Run:ai to align job placement, isolation, and performance guarantees with the actual structure of the NVLink fabric, without exposing that complexity directly to end users.
Figure 4. Cluster UUID and cliqueID on a NVIDIA Grace Blackwell compute tray
Scheduling Multi-Node NVLink workloads with Slurm
Once you start running multi-node workloads on Blackwell-based NVL72 systems, placement becomes as important as GPU count. A 16-GPU job spread across the wrong nodes can behave very differently from the same job confined to a single NVLink fabric.
This is where Slurm’s topology/block plugin becomes essential, enabling Slurm to recognize that not all nodes are equal.
On Grace Blackwell NVL72 blocks of nodes with lower-latency connections map directly to NVLink partitions—groups of GPUs that share a high-bandwidth NVLink fabric.
Consider a simple example with two NVL72 racks. Both racks may belong to the same Slurm partition (queue), but that doesn’t mean jobs should freely span them. From a performance standpoint, each rack—or each NVLink partition carved from it—is a distinct high-bandwidth block.
Figure 5. NVL racks consumed as blocks in Slurm. Two racks are grouped under a common Slurm partition
By enabling the topology/block plugin and exposing NVLink partitions as blocks, Slurm gains the context it needs to make better decisions. Jobs are placed within a single NVLink partition (or block) by default, preserving MNNVL performance. Larger jobs can span blocks when necessary, but the tradeoff becomes explicit rather than accidental.
Figure 6. Individual racks that are separate blocks to Slurm
In practice, this means:
One block/node group per rack, with Slurm QoS applied at the user or group level to manage access to the shared partition.
Multiple blocks/node groups per rack when offering smaller, isolated high-bandwidth GPU pools. In this model, each block/node group maps to a Slurm partition, providing a service tier per partition. Users automatically land inside the intended NVLink partition through the Slurm partition—without needing to understand the underlying fabric.
Figure 7. Multiple NVLink partitions within a rack, each mapped to Slurm blocks and partitions
IMEX management with Slurm: From rack-level service to per-job isolation
For multi-node NVIDIA CUDA workloads that rely on MNNVL, IMEX enables GPUs on different compute trays to participate in a shared-memory programming model.
Figure 8. How IMEX service runs on compute trays at the OS level, and how NVSwitches enable IMEX to create MNNVL connections
From an application’s point of view, using MNNVL looks deceptively simple. Under the hood, however, Mission Control ensures a few things line up when running MNNVL jobs with Slurm:
IMEX runs on exactly the set of compute trays participating in the job.
Those trays belong to a common NVLink partition.
The IMEX lifecycle is reliable, secure, and scoped tightly enough to avoid cross-job interference.
Figure 9. Per-job IMEX in Slurm, where two jobs share an NVL72 rack and NVLink partition but use separate IMEX domains
IMEX is a system-level daemon running in the host OS of every compute tray. It provides memory-sharing and synchronization mechanisms that CUDA libraries build on. If IMEX is mis-scoped, left running too broadly, or fails noisily, multi-node workloads quickly become fragile.
Extending multi-node NVLink support to Kubernetes and NVIDIA Run:ai
Just as Slurm needs help understanding NVLink fabrics to schedule Grace Blackwell NVL72 jobs optimally, Kubernetes also lacks native awareness of rack-scale high-bandwidth interconnects like NVLink.
To address this, the NVIDIA ecosystem combines ComputeDomains (via the NVIDIA Dynamic Resource Allocation (DRA) GPU driver) and NVIDIA Run:ai integration—lifting NVLink and IMEX concepts into domain-aware, scheduler-ready primitives.
In Kubernetes, dynamic resource allocation (DRA) provides an API for flexible, fine-grained management of specialized hardware.
The NVIDIA k8s-dra-driver-gpu implements this API for GPUs and ComputeDomains. The driver includes separate kubelet plugins for GPU allocation, and for ComputeDomains, and when enabled, manages GPU multi-node NVLink domains—grouping GPUs across nodes into secure, high-bandwidth, memory-coherent domains for HPC and AI workloads.
Kubernetes: From flat schedulers to NVLink-aware placement
In Kubernetes, the core challenge is similar to Slurm. Kubernetes pods need to be placed on nodes that share high-bandwidth connectivity. Kubernetes by itself doesn’t understand NVLink domains, so workloads might be scattered across nodes that lack the required fabric connectivity for efficient multi-node execution.
The solution is the ComputeDomains concept provided by the NVIDIA DRA driver for GPUs. A ComputeDomain:
Represents a set of nodes that share an NVLink/MNNVL domain.
A workload submission must create the ComputeDomain object and link to that object through a ResourceClaim when a distributed workload (training or inference) is submitted.
Is tied to the exact set of pods participating in the workload.
Is torn down when the workload ends.
By doing this, ComputeDomains make the high-performance fabric first-class in scheduling. ComputeDomains also ensure that underlying resources like IMEX channels are properly instantiated and scoped for the workload.
---apiVersion: resource.nvidia.com/v1beta1kind: ComputeDomainmetadata: name: compute-domainspec: numNodes: 18 channel: resourceClaimTemplate: name: compute-domain-resource-claim---apiVersion: apps/v1kind: Deploymentmetadata: name: test-workload labels: app: test-workloadspec: replicas: 18 selector: matchLabels: app: test-workload template: metadata: labels: app: test-workload spec: containers: - name: ctr image: ubuntu:22.04 command: ["test-workload"] resources: limits: nvidia.com/gpu: 4 claims: - name: compute-domain resourceClaims: - name: compute-domain resourceClaimTemplateName: compute-domain-resource-claim
How NVIDIA Run:ai simplifies distributed workloads on NVLink domains
NVIDIA Run:ai builds on Kubernetes and ComputeDomains to make Grace Blackwell NVL72 systems usable without exposing users to NVLink topology, IMEX domains, or low-level scheduling mechanics. Users request distributed GPUs. NVIDIA Run:ai handles the rest.
Figure 10. Topology-aware scheduling keeps workload 2 within a single rack by considering resources and job requirements
Under the hood, NVIDIA Run:ai automates several critical pieces:
Automatic detection and labeling: GB200 NVL72 nodes are identified and labeled based on their NVLink/MNNVL domain membership. These labels form the foundation for NVLink-aware placement.
ComputeDomain-backed workload placement: When a job is submitted, it automatically creates and attaches a ComputeDomain. Pods are placed on nodes that share NVLink connectivity and a correctly scoped IMEX domain.
Topology-aware scheduling: When a network topology is defined for a node pool, NVIDIA Run:ai applies topology-aware scheduling preferences to keep all pods in a distributed workload as close as possible, only expanding outward when necessary. This reduces latency and avoids accidental cross-domain placement.
Automatic topology detection with Topograph
All of these approaches rely on accurate knowledge of the underlying hardware and network topology. For platform engineers, manually defining NVLink domains, rack boundaries, or network hierarchies does not scale in large or frequently changing environments. This is where Topograph, an open source NVIDIA tool, fits into the picture.
Topograph automatically discovers cluster topology by collecting node-level and infrastructure metadata and translating it into scheduler-consumable representations. It can detect how nodes are connected—across racks, switches, or fabric hierarchies—and reveals this structure through APIs that higher-level systems can consume. Instead of assuming a flat cluster, schedulers gain a concrete view of proximity and bandwidth relationships.
When combined with Slurm and Kubernetes—along with ComputeDomains and NVIDIA Run:ai—Topograph enables a fully automated flow. Topology is discovered rather than hand-modeled, nodes are labeled consistently, and topology-aware placement decisions can be made with minimal operator intervention. This closes the gap between physical infrastructure and logical scheduling abstractions.
Learn more about advanced AI operations
Check out the Mission Control Administrator Guide or User Guide to learn more about implementation.
Watch an on-demand NVIDIA GTC 2026 session with Eli Lilly & Company and learn how they went from rack-scale hardware to schedulable AI infrastructure with powerful, intelligent software.
関連記事
米国、中国企業向け Nvidia 最高級チップの海外販売ループホールを閉鎖へ
米商務省は、本社が中国にある企業の海外子会社による NVIDIA 製先端チップ購入にも輸出ライセンス要件を適用する指針を発出し、既存の回避策を封じる方針を示した。
サムスン経営陣、スマートフォン事業で初となる赤字の可能性を懸念
サムスン電子のモバイル部門責任者TM氏は、スマートフォンの成熟化とAI競争激化により、2026年に事業赤字となる可能性があると懸念を示した。
AIチップスタートアップCerebrasがIPO申請
AIチップスタートアップのCerebrasがIPOを申請した。同社は最近、Amazon Web Servicesとのデータセンター向けチップ使用契約や、OpenAIとの100億ドル超の契約を発表している。