NVIDIA Developer Blog·2026年5月22日 02:32·約15分

Slurm のトポロジー対応ジョブスケジューリングにより NVIDIA GB200 NVL72 でエクサスケール性能を発揮

#LLM #Blackwell Architecture #NVIDIA GB200 NVL72 #HPC #Job Scheduling

TL;DR

NVIDIA は、GB200 NVL72 のエクサスケール性能を最大限に引き出すために、Slurm スケジューラによるトポロジ対応ジョブスケジューリングの重要性と実装推奨事項を解説している。

AI深層分析2026年7月4日 23:13

重要/ 5段階

深度40%

キーポイント

GB200 NVL72 のアーキテクチャ特性

1 ラックでエクサスケール計算能力を実現するこのシステムは、NVIDIA Blackwell GPU を 72 基搭載し、NVLink を介して 130 TB/s の帯域幅を提供している。

トポロジ対応スケジューリングの必要性

共有クラスタ環境においてハードウェア性能を最大限活用するには、ジョブ配置がシステムネットワークトポロジを理解し、NVLink ファブリックを最適に利用する必要がある。

Slurm を用いた最適化手法

NVIDIA は Slurm スケジューラのトポロジ対応機能を活用することで、GPU 占有率を最大化し、大規模 AI トレーニングや推論の性能向上を図る具体的な推奨事項を提示している。

Slurm の新プラグイン導入

NVIDIA と SchedMD は、GB200 NVL72 などのラックスケールシステム向けに、従来のベストエフォート方式ではなくリソースの断片化を防ぐ「topology/block」プラグインを Slurm 23.11 で導入しました。

物理ネットワーク構造に基づく最適化

トポロジ対応スケジューリングは、スイッチやラックの階層構造を考慮し、可能な限り NVLink ドメイン内でワークロードを維持して局所性を保つことで、AI 学習ジョブのパフォーマンスを最大化します。

共有クラスターでのリソース効率化

複数のトレーニングジョブが実行される共有クラスターでは、ネットワーク帯域要件の違いに対応し、効率的なビンパッキングを行うことでリソースの断片化を回避する必要があります。

NVL72 ドメインに合わせたジョブスケジューリング

トポロジープラグイン設定により、Slurm ジョブを NVL72 ドメインの境界と整合させるアルゴリズムが可能になり、ハードウェア障害時でも効率を最大化します。

影響分析・編集コメントを表示

影響分析

本記事は、次世代 AI インフラである GB200 NVL72 の導入において、単なるハードウェアの購入だけでなく、適切なソフトウェア層（スケジューラ）の構成が成否を分けることを示唆しています。企業や研究機関がこの最新アーキテクチャを導入する際、ネットワークトポロジを無視したジョブ配置は性能ボトルネックとなり得るため、Slurm などの管理ツールの設定見直しが即座に求められる重要な指針となります。

編集コメント

最新ハードウェアの性能を最大限引き出すには、OS やミドルウェアレベルでの最適化が不可欠であり、特に大規模クラスター運用においてはトポロジ認識型のスケジューリング設定が鍵となります。

AI モデルの規模と複雑性が増大する中、現代の加速インフラの性能を最大限に引き出すには、ハードウェアそのものと同様にワークロードの配置方法が重要となります。NVIDIA GB200 NVL72 は単一のラックでエクサスケール計算を実現し、リアルタイムでのトリリオンパラメータモデルの実行を可能にします。しかし、共有クラスタにおいてその性能を引き出すには、システムアーキテクチャを理解し、ジョブをネットワークトポロジと整合させるスケジューラーが必要です。

本記事では、Slurm のトポロジ認識型ジョブスケジューリングが NVIDIA GB200 NVL72 でどのように機能するかを解説し、最適な GPU 稼働率を得るためのスケジューリング推奨事項を提供します。

NVIDIA GB200 NVL72 はエクサスケール計算をどのように実現するのか？

NVIDIA GB200 NVL72 は、単一のラックに収まるエクサスケールコンピューターです。72 個の NVIDIA Blackwell GPU が、生産規模で最大級のアップグレード計算ファブリックによって相互接続されており、NVIDIA NVLink は AI およびハイパフォーマンスコンピューティング (HPC) ワークロード向けに、1 秒あたり 130 テラバイト (TB/s) の低遅延 GPU 通信帯域幅を提供します。複数の GB200 NVL72 システムをクラスタで結合することで、非常に高いネットワーク帯域域を持つ大規模ドメインを含む階層型ネットワークトポロジが構築されます。

AI 学習ジョブは、NVLLink ファブリックの活用を最大化するようにスケジューリングされる場合、GB200 NVL72 が提供する豊富なネットワーク帯域幅から大きな恩恵を受けることができます。最近の結果では、GB200 NVL72 は、トレーニング（最新の MLPerf トレーニングで 2.6 倍以上の性能向上）、さまざまな推論ユースケース（トリリオンパラメータモデルのリアルタイム推論、OAI gpt-oss モデルで 150 万トークン/秒以上、最先端の分散型サービス）、そして推論能力を含む、すべての AI ワークロードにおいて顕著な性能向上をもたらすことが示されています。

複数のトレーニングジョブを実行する共有クラスタでは、リソース効率的なスケジューラは、異なるネットワーク帯域幅要件を考慮する必要があります。

トポロジ対応型ジョブスケジューリングとは何ですか？

トポロジ対応ジョブスケジューリングは、Slurm などのジョブスケジューラが、スイッチやラックの階層構造などクラスターの物理ネットワークレイアウトに基づいてリソース割り当ての判断を行えるようにするものです。スケジューラは可能な限りワークロードを同じ NVLink ドメイン内に保持し、局所性を維持すべきです。さらに、複数のトレーニングまたは推論ジョブが NVL72 ラックのグループに収容できるため、スケジューラはリソースの断片化を防ぐために効率的なビンパッキングを提供する必要があります。

長年使われてきた Slurm のトポロジ/ツリープラグインは大規模クラスターに対してトポロジ対応スケジューリングを提供しますが、そのベストエフォート型のアプローチはキュー時間を短縮するためにジョブをリーフスイッチ間で断片化させる傾向があります。この開始時間とパフォーマンスの間の妥協は従来の InfiniBand ファブリックでは許容されていましたが、GB200 NVL72 や GB300 NVL72 などのラックスケールシステムの登場により変更が必要となりました。これに対応し、NVIDIA と SchedMD は Slurm 23.11 でこれらの現代アーキテクチャに特化した新しいトポロジ/ブロックプラグインの導入を共同で発表しました。

このトポロジプラグイン設定は、同じ NVL72 ドメインに属するノードのグループに関する情報を提供し、Slurm ジョブを NVL72 ドメイン境界と整合させるアルゴリズムを可能にします。ブロックトポロジプラグインの詳細やセグメントサイズのスケジューリング方法については、Slurm ブロックスケジューリングによる NVIDIA GB200 NVL72 でのシステムおよびワークロード効率の最大化をご覧ください。

GB200 NVL72 におけるクラスター分割とジョブスケジューリングはどのように機能するか？

クラスターの規模と複雑性が増大するにつれ、高い利用率と予測可能なパフォーマンスを達成するために GPU リソースの管理が極めて重要になります。GB200 NVL72 システムでは、より大きな AI ジョブのセグメントサイズと微細なスケジューリング制御を導入し、オペレーターがワークロードのニーズに合わせてセグメント構成を調整できるようにしています。Slurm ワークロードマネージャーにおける GB200 NVL72 対応のスケジューリング拡張機能とともに、このアプローチはハードウェア障害が存在する場合でも、大規模ジョブと小規模ジョブのバランスを取りながら効率を最大化します。

GB200 NVL72 はどのようにしてより大きなセグメントサイズを実現するのか？

マルチ GPU ワークロードにおいて、ジョブセグメントサイズは、NVLLink 上で完全に相互通信可能なノードで構成されるサブユニットを定義します。図 1 は、特定のジョブに割り当てられる GPU を定義するために、セグメント番号 (Y) とセグメントサイズ (S) がどのように使用されるかを示しています。GB200 および GB300 におけるノードあたりの GPU 数 (G) は常に 4 です。

image*図 1. GB200 NVL72 のジョブサイズは、NVLink を介したより大規模でスケーラブルな GPU グループ化を可能にします*

以前のシステム（NVIDIA HGX H100 など）では、ジョブは 1 ノード分のセグメントサイズに制限されていました。一方、GB200 NVL72 システムは、より大きなセグメントサイズ（最大 18 ノードまで）をサポートするだけでなく、セグメントを単一ノードとして効率的に扱うことも可能です。

特定のアプリケーションにおける最適なセグメントサイズは、モデルの種類やトレーニングに使用される並列処理タイプの組み合わせなどの要因によって決定されます。一般的に、より大規模なジョブ（より多くの GPU を利用するもの）や、高い I/O バンド幅を必要とするジョブ（例えば、Mixture-of-Experts (MoE) トレーニングなど）は、大きなセグメントサイズから恩恵を受けます。逆に、小規模なジョブは通常、I/O バンド幅の要件が低いため、クラスタースケジューラーを過度に制約しないよう、小さなセグメントサイズを使用すべきです。不確実な場合は、パフォーマンスへの影響がワークロード固有となる可能性があるため、ユーザーは自らの特定のワークロードに対してこのガイダンスを検証する必要があります。

GB200 NVL72 のセグメントサイズ設定におけるベストプラクティスとは？

モデリングにおいて、当チームは GB200 NVL72 クラスターの利用率を最大化するためのいくつかの一般的なガイドラインを見出しました。一つの目安として、クラスター内の GPU 使用時間の割合が 90% を超えないように、「大規模」なセグメントサイズ（16 ノード）を使用するクリティカルジョブサイズの選択が挙げられます。これにより、スケジューラーはさまざまなセグメントサイズを適切に組み合わせながら、クラスターをフル活用するための柔軟性を得ることができます。表 1 は、推奨される最適な構成のいくつかを要約したものです。

ジョブサイズセグメントサイズ例示ワークロード

12816MoE モデルトレーニング

32 – 644大規模密モデルトレーニング

32 未満1小規模モデルトレーニング

*表 1. ジョブサイズとワークロードタイプ別 GB200 NVL72 の推奨セグメントサイズ*

なお、本記事の目的上、ユーザージョブは 2 のべき乗数の GPU セグメントサイズ（例：4 ノード = 16 GPU）での実行を好むものと仮定しています。他のセグメントサイズ（例：セグメントあたり 12、36、または 72 GPU など）を選択することも可能です。代替アプローチが妥当かどうかを判断するには、非 2 のべき乗数のセグメントサイズにジョブをマッピングした際の効率性と、異なるサイズのジョブがクラスター全体の利用率に与える影響を検討する必要があります。

GB200 NVL72 システム上でのジョブスケジューリング方法

NVIDIA と SchedMD は、高利用率を実現するための GB200 NVL72 対応ジョブ配置を可能にする Slurm ベースのブロックスケジューリング拡張機能を開発しました。

2 の累乗サイズのセグメントを使用することで、GB200 NVL72 クラスターは大規模ジョブと小規模ジョブを並行して実行できます。例えば、16 ノードのセグメントを使用した 512 GPU を要するジョブ 1 つと、単一ノードのセグメントを使用した 16 GPU のジョブを複数同時に走らせることが可能です。これらのスケジューリングポリシーは、クラスター全体で高い効率性を維持しつつ、フラグメンテーション（断片化）を最小限に抑えます。

GB200 NVL72 のスケジューリングシミュレーションフレームワークとは何か？

大規模なスケジューリング戦略を評価するために、私たちは仮想マシン上で動作し、時間圧縮されたワークロードシミュレーションを可能にするスタンドアロンの Slurm シミュレータを開発しました。図 2 に示す通り、このシミュレータは以下の機能により、正確で再現性のある結果を提供します。

Slurm コードの実行
本番環境のワークロードの再生または合成ワークロードの生成
ノード障害や復旧など現実世界の条件のシミュレーション
メトリクスシステムとの統合による結果の直接比較

この構成により、本番環境への展開前に新しいスケジューリングポリシーをテストし、比較し、自信を持って導入するための大きな利点を得ることができます。

image*図 2. Slurm シミュレータフローにおいて、本番環境とテスト環境間で実測値とシミュレーション値が比較される*

シミュレーションパラメータ

チームがモデル化したシミュレーション環境のパラメータは以下の通りです。

クラスター容量：5,000 台の GB200 NVL72 ノード（GPU 計 20,000 基）
ワークロード：7 日間にわたる 15,000 ジョブ
信頼性：任意の時点で平均 2.5% のノードがダウンしている状態

image 各カテゴリには 2 つの棒グラフが表示されています。灰色の棒はジョブ数の割合を、緑色の棒はノード使用時間の割合を示しています。「Large」カテゴリでは両方の値が最も高く、緑色の棒（ノード使用時間）がわずかに灰色の棒を上回っています。「XLarge」も高いノード使用時間の割合を持ちますが、ジョブ数は低くなっています。「Small」と「Medium」カテゴリは高いジョブ数ですが、ノード使用時間は低い傾向にあります。縦軸は 0 から 50 のパーセンテージを表しています。*Figure 3. ノード数別バケットにおけるジョブの分布状況を示す図。総ジョブ数の割合と総ノード使用時間の割合を比較*

チームは、利用効率と大規模ジョブのパフォーマンスのバランスを取るために設計された Large_Perf_Custom ポリシーを使用してパフォーマンスを評価しました:

32 ノード以上のジョブは、セグメントサイズ 16 で実行されました

より小さなジョブは、セグメントサイズ 2 で実行されました

シミュレーション結果は何を示しているか？

新しいスケジューリング戦略のパフォーマンスを評価するために、私たちは 2 つの主要なクラスター指標に焦点を当てました: ブロックの断片化と全体の GPU 稼働率です。

断片化分析

GB200 NVL72 のスケジューリングにおける重要な指標は、小規模ジョブが大規模ジョブに対する NVLink ドメインの利用可能性にどのような影響を与えるかという点です。シミュレータは、各 NVLink ドメイン内で小規模ジョブ（1〜18 ノード）がどのように配置されたかを追跡しました。

主な発見は、トポロジープラグインが小規模ジョブを各ドメインの最後の 2 つのノードに効果的に配置し、断片化を最小限に抑えつつ、大規模ジョブのための容量を確保していることです。

image*図 4. 断片化を最小限に抑えるために、各ドメインの最後の 2 つのノードに小規模ジョブが集中して配置されている様子を示すヒートマップ*

稼働率指標

トポロジー認識型スケジューリングは制約を導入しますが、私たちの結果では、最適なトポロジー認識型スケジューリングの実装を通じて、全体の稼働率への影響をほぼ完全に排除できることが示されました。図 5 では、Large_Perf_Custom と NoTopo の間の差が約 1% であることが示されています。このギャップは、さらに多くの小規模ジョブを追加することで埋めることができます。

image*図 5. シミュレーション結果は、柔軟なセグメントサイズにより占有率が向上することを示しています*

私たちは、開発した Large_Perf_Custom アルゴリズムにおける占有率と、noTopo ポリシー（noTopo 設定）を比較しました。ここでいう noTopo 設定とは、ジョブサイズの分布を考慮して理論的に可能な最高の占有率を表すものであり、noTopo アルゴリズムにおいて配置が不適切な場合に生じる大きなランタイムのペナルティは無視したものです。実用的な目標は、トポロジーに無知なスケジューリングによるパフォーマンスの低下を回避しつつ、可能な限り noTopo の占有率に近づけることです。

結果として、私たちのシミュレーションでは noTopo とほぼ 1% の差で占有率を達成し、トポロジー認識型スケジューリングが性能を犠牲にすることなく高い利用率を実現できることが示されました。

GB200 NVL72 における最適なジョブスケジューリング手法とは何か？

シミュレーション結果およびパフォーマンステストに基づき、大規模ジョブのパフォーマンスを優先しつつ高い利用率を維持する NVIDIA GB200 NVL72 クラスター向けのスケジューリングアプローチを推奨します。64 GPU 以上の大規模ジョブは、NVLink ドメインの最大数へのアクセス権限を与え、セグメントサイズ調整によりドメイン間で GPU が比例配分されるようにする必要があります。リソースとワークロードパターンを整合させるためには、セグメントベースのスケジューリングが不可欠です。32 ノード以上のジョブについては、アプリケーションがその恩恵を受けられる場合、セグメントサイズ 16 を推奨します。一方、小規模なジョブはワークロード特性に応じて、セグメントサイズ 2 から 8 がより適しています。

時間経過に伴う効率を維持するためには、継続的な監視と最適化が重要です。フラグメンテーション指標の追跡、ワークロードパターンの進化に応じたセグメントサイズの調整、本番環境への展開前のシミュレーションツールによる変更検証を行うことで、パフォーマンスを犠牲にすることなく高い利用率を維持できます。ブロックトポロジーは占有率を低下させる制約をもたらす可能性がありますが、戦略的なスケジューリングポリシーを適用することでこの影響を緩和し、パフォーマンス上のメリットを保持することが可能です。

NVIDIA GB200 NVL72 の活用開始

NVIDIA GB200 NVL72 システムは、AI および HPC（High Performance Computing）コンピューティングにおける主要な進展を表しており、その潜在能力を最大限に引き出すにはトポロジ対応型スケジューリングが必要です。当社のモデル化により、単純な設定とセグメントベースのスケジューリングを用いることで、高いクラスター利用率を維持しながら最適なパフォーマンスを実現できることが示されています。異なるスケジューリングシナリオをシミュレーションする能力は、本番環境のワークロードにリスクを負うことなく、新しいポリシーを確信を持って導入することを可能にします。NVIDIA GB200 NVL72 について詳しくはこちら。

原文を表示

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on the hardware itself. NVIDIA GB200 NVL72 delivers exascale compute in a single rack, unlocking real-time trillion-parameter models. Yet capturing that performance in a shared cluster requires schedulers that understand the system architecture and align jobs with its network topology.

This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy.

How does NVIDIA GB200 NVL72 deliver exascale compute?

NVIDIA GB200 NVL72 is an exascale computer in a single rack. With 72 NVIDIA Blackwell GPUs interconnected by the largest production scale-up compute fabric, NVIDIA NVLink provides 130 terabytes per second (TB/s) of low-latency GPU communication bandwidth for AI and high-performance computing (HPC) workloads. Multiple GB200 NVL72 systems combined in a cluster create hierarchical network topology with large domains of very high networking bandwidth.

An AI training job can greatly benefit from the abundant networking bandwidth offered by GB200 NVL72, when scheduled to maximize the use of NVLink fabrics. Recent results show that GB200 NVL72 delivers significant improvement in performance for all AI workloads, including training (>2.6x with recent MLPerf training), across different inference use cases (real-time inference for trillion-parameter models, >1.5 million tokens/second for the OAI gpt-oss model, state-of-art disaggregate serving), as well as reasoning.

In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements.

What is topology-aware job scheduling?

Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation.

The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort approach often fragments jobs across leaf switches to reduce queue time. While this compromise between start time and performance was acceptable for traditional InfiniBand fabrics, the advent of rack-scale systems like GB200 NVL72 and GB300 NVL72 necessitated a change. In response, NVIDIA and SchedMD collaborated to launch the new topology/block plugin in Slurm 23.11, specifically designed for these modern architectures.

This topology plugin configuration provides information about groups of nodes belonging to the same NVL72 domain, which enables algorithms that can align Slurm jobs with NVL72 domain boundaries. To learn more about the block topology plugin and how segment sizes are scheduled, see Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling.

How do cluster segmentation and job scheduling work on GB200 NVL72?

As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling operators to align segment configurations with workload needs. Together with GB200 NVL72-aware scheduling extensions in the Slurm workload manager, this approach balances large and small jobs to maximize efficiency even in the presence of hardware faults.

How does GB200 NVL72 enable larger segment sizes?

In multi-GPU workloads, the job segment size defines the subunit made of nodes that can communicate with each other entirely over NVLink. Figure 1 illustrates how segment number (Y) and segment size (S) are used to define the GPUs assigned to a specific job. GPUs per node (G) is always four for GB200 and GB300.

Figure 1. GB200 NVL72 job size enables larger, scalable GPU groupings over NVLink

In prior systems, such as NVIDIA HGX H100, jobs were limited to a segment size of one node. The GB200 NVL72 system supports much larger segment sizes (up to 18 nodes) while also efficiently supporting segments as a single node.

The optimal segment size for a given application is determined by factors such as model type and the combination of parallelism types used for training. Generally, larger jobs (those utilizing more GPUs) and those with high I/O bandwidth requirements—mixture-of-experts (MoE) training, for example—benefit from larger segment sizes. Conversely, smaller jobs typically have lower I/O bandwidth needs and should use a smaller segment size to prevent over-constraining the cluster scheduler. Users should validate this guidance for their specific workloads if unsure, as performance effects can be workload-specific.

What are best practices for GB200 NVL72 segment sizing?

In modeling, our team found a few general guidelines for maximizing GB200 NVL72 cluster utilization. A rule of thumb is to choose the critical job size that uses a “large” segment size of 16 nodes such that the percentage of GPU hours in the cluster for those jobs is <= 90%. This will give the scheduler flexibility to fully utilize the cluster with a good mix of segment sizes. Table 1 summarizes some of the recommended optimal configurations.

Note that, for the purposes of this post, we assume user jobs prefer to run with a power-of-two GPUs segment sizes (for example, 4 nodes = 16 GPUs). It is also possible to choose other segment sizes (12, 36, or 72 GPUs per segment, for example). To decide whether an alternate approach makes sense, study the efficiency of your jobs when mapped across a non-power-of-two segment size, and the effect on overall utilization of the cluster for different sized jobs.

How to schedule jobs on GB200 NVL72 systems

NVIDIA and SchedMD have developed block scheduling extensions built on Slurm that enable GB200 NVL72-aware job placement for high utilization.

With power-of-two segment sizes, an GB200 NVL72 cluster can run large and small jobs side by side—for example, one 512 GPU job using 16 node segments alongside several 16 GPU jobs using single node segments. These scheduling policies minimize fragmentation while maintaining high efficiency across the cluster.

What is the GB200 NVL72 scheduling simulation framework?

To evaluate scheduling strategies at scale, we developed a standalone Slurm simulator that runs on a virtual machine and enables time-accelerated workload simulation. As shown in Figure 2, this simulator provides accurate and repeatable results by:

Running the Slurm code

Replaying production workloads or generating synthetic workloads

Simulating real-world conditions, including node failures and recoveries

Integrating with the metrics system for direct comparison of results

This setup provides significant leverage to test, compare, and confidently roll out new scheduling policies before deploying them in production.

Figure 2. Real and simulated metrics are compared across production and test environments in the Slurm simulator flow

Simulation parameters

Parameters of the simulation environment the team modeled include:

Cluster capacity: 5,000 GB200 NVL72 nodes (20,000 GPUs)

Workload: 15,000 jobs over a seven-day period

Reliability: Average of 2.5% of nodes down at any given time

Figure 3. Job distribution across node count buckets showing percentage of total jobs versus percentage of total node hours

The team evaluated performance using a Large_Perf_Custom policy, designed to balance utilization and large job performance:

Jobs with 32 nodes or more ran with a segment size of 16

Smaller jobs ran with a segment size of two

What do the simulation results show?

To evaluate the performance of the new scheduling strategies, we focused on two key primary cluster metrics: fragmentation of blocks and overall GPU occupancy.

Fragmentation analysis

A key metric for GB200 NVL72 scheduling is how small jobs impact NVLink domain availability for large jobs. The simulator tracked how small jobs (1-18 nodes) were placed within each NVLink domain.

The key finding was that the topology plugin effectively placed small jobs on the last two nodes of each domain, minimizing fragmentation and preserving capacity for larger jobs.

Figure 4. Heat map showing concentrated placement of small jobs on the last two nodes of each domain to minimize fragmentation

Occupancy metrics

While topology-aware scheduling introduces constraints, our results showed that its impact on overall occupancy can be almost entirely eliminated through an optimal topology-aware scheduling implementation. Figure 5 shows only ~1% difference between Large_Perf_Custom and NoTopo. The gap can be further filled with more small jobs.

Figure 5. Simulation results show that occupancy increases with flexible segment sizes

We compared occupancy under the Large_Perf_Custom algorithm we developed, versus a noTopo policy, where the noTopo configuration represents the best theoretical occupancy possible given the job size distribution, ignoring the large runtime penalties that would result from poor placement in the noTopo algorithm. The practical goal is to get as close as possible to noTopo occupancy while avoiding the performance penalties of topology-naive scheduling.

Results show that our simulation achieved occupancy within roughly 1% of noTopo, demonstrating that topology-aware scheduling can deliver high utilization without sacrificing performance.

What is the best job scheduling approach for GB200 NVL72?

Based on our simulation results and performance testing, we recommend a scheduling approach for NVIDIA GB200 NVL72 clusters that prioritizes large job performance while maintaining high utilization. Large jobs of 64 GPUs or more should be given access to the maximum number of NVLink domains, using segment sizing to ensure proportional GPU allocation across domains. Segment-based scheduling is essential for aligning resources with workload patterns. For jobs of 32 nodes or more, a segment size of 16 is recommended if the application can benefit from it, while smaller jobs are better suited to segment sizes of two to eight, depending on workload characteristics.

To maintain efficiency over time, it is important to monitor and optimize continuously. Tracking fragmentation metrics, adjusting segment sizes as workload patterns evolve, and validating changes with simulation tools before production deployment can help sustain high utilization without sacrificing performance. While block topology can introduce constraints that reduce occupancy, applying strategic scheduling policies can mitigate this effect and preserve performance benefits.

Get started with NVIDIA GB200 NVL72

The NVIDIA GB200 NVL72 system represents a major advancement in AI and HPC computing, and unlocking its full potential requires topology-aware scheduling. Our modeling demonstrates that, with simple configuration and segment-based scheduling, it is possible to achieve optimal performance while maintaining high cluster utilization. The ability to simulate different scheduling scenarios further enables confident deployment of new policies without risking production workloads. Learn more about NVIDIA GB200 NVL72.

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年5月22日 02:32·約15分

Slurm のトポロジー対応ジョブスケジューリングにより NVIDIA GB200 NVL72 でエクサスケール性能を発揮

#LLM #Blackwell Architecture #NVIDIA GB200 NVL72 #HPC #Job Scheduling

TL;DR

AI深層分析2026年7月4日 23:13

重要/ 5段階

深度40%

キーポイント

GB200 NVL72 のアーキテクチャ特性

1 ラックでエクサスケール計算能力を実現するこのシステムは、NVIDIA Blackwell GPU を 72 基搭載し、NVLink を介して 130 TB/s の帯域幅を提供している。

トポロジ対応スケジューリングの必要性

Slurm を用いた最適化手法

Slurm の新プラグイン導入

物理ネットワーク構造に基づく最適化

共有クラスターでのリソース効率化

NVL72 ドメインに合わせたジョブスケジューリング

影響分析・編集コメントを表示

影響分析

編集コメント

NVIDIA GB200 NVL72 はエクサスケール計算をどのように実現するのか？

トポロジ対応型ジョブスケジューリングとは何ですか？

GB200 NVL72 におけるクラスター分割とジョブスケジューリングはどのように機能するか？

GB200 NVL72 はどのようにしてより大きなセグメントサイズを実現するのか？

image*図 1. GB200 NVL72 のジョブサイズは、NVLink を介したより大規模でスケーラブルな GPU グループ化を可能にします*

GB200 NVL72 のセグメントサイズ設定におけるベストプラクティスとは？

ジョブサイズセグメントサイズ例示ワークロード

12816MoE モデルトレーニング

32 – 644大規模密モデルトレーニング

32 未満1小規模モデルトレーニング

*表 1. ジョブサイズとワークロードタイプ別 GB200 NVL72 の推奨セグメントサイズ*

GB200 NVL72 システム上でのジョブスケジューリング方法

GB200 NVL72 のスケジューリングシミュレーションフレームワークとは何か？

Slurm コードの実行
本番環境のワークロードの再生または合成ワークロードの生成
ノード障害や復旧など現実世界の条件のシミュレーション
メトリクスシステムとの統合による結果の直接比較

image*図 2. Slurm シミュレータフローにおいて、本番環境とテスト環境間で実測値とシミュレーション値が比較される*

シミュレーションパラメータ

チームがモデル化したシミュレーション環境のパラメータは以下の通りです。

クラスター容量：5,000 台の GB200 NVL72 ノード（GPU 計 20,000 基）
ワークロード：7 日間にわたる 15,000 ジョブ
信頼性：任意の時点で平均 2.5% のノードがダウンしている状態

32 ノード以上のジョブは、セグメントサイズ 16 で実行されました

より小さなジョブは、セグメントサイズ 2 で実行されました

シミュレーション結果は何を示しているか？

断片化分析

image*図 4. 断片化を最小限に抑えるために、各ドメインの最後の 2 つのノードに小規模ジョブが集中して配置されている様子を示すヒートマップ*

稼働率指標

image*図 5. シミュレーション結果は、柔軟なセグメントサイズにより占有率が向上することを示しています*

GB200 NVL72 における最適なジョブスケジューリング手法とは何か？

NVIDIA GB200 NVL72 の活用開始

原文を表示

This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy.

How does NVIDIA GB200 NVL72 deliver exascale compute?

In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements.

What is topology-aware job scheduling?

How do cluster segmentation and job scheduling work on GB200 NVL72?

How does GB200 NVL72 enable larger segment sizes?

What are best practices for GB200 NVL72 segment sizing?

How to schedule jobs on GB200 NVL72 systems

NVIDIA and SchedMD have developed block scheduling extensions built on Slurm that enable GB200 NVL72-aware job placement for high utilization.

What is the GB200 NVL72 scheduling simulation framework?

Running the Slurm code

Replaying production workloads or generating synthetic workloads

Simulating real-world conditions, including node failures and recoveries

Integrating with the metrics system for direct comparison of results

This setup provides significant leverage to test, compare, and confidently roll out new scheduling policies before deploying them in production.

Simulation parameters

Parameters of the simulation environment the team modeled include:

Cluster capacity: 5,000 GB200 NVL72 nodes (20,000 GPUs)

Workload: 15,000 jobs over a seven-day period

Reliability: Average of 2.5% of nodes down at any given time

The team evaluated performance using a Large_Perf_Custom policy, designed to balance utilization and large job performance:

Jobs with 32 nodes or more ran with a segment size of 16

Smaller jobs ran with a segment size of two

What do the simulation results show?

To evaluate the performance of the new scheduling strategies, we focused on two key primary cluster metrics: fragmentation of blocks and overall GPU occupancy.

Fragmentation analysis

A key metric for GB200 NVL72 scheduling is how small jobs impact NVLink domain availability for large jobs. The simulator tracked how small jobs (1-18 nodes) were placed within each NVLink domain.

The key finding was that the topology plugin effectively placed small jobs on the last two nodes of each domain, minimizing fragmentation and preserving capacity for larger jobs.

Occupancy metrics

Results show that our simulation achieved occupancy within roughly 1% of noTopo, demonstrating that topology-aware scheduling can deliver high utilization without sacrificing performance.

What is the best job scheduling approach for GB200 NVL72?

Get started with NVIDIA GB200 NVL72

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

NVIDIA GB200 NVL72 はエクサスケール計算をどのように実現するのか？

トポロジ対応型ジョブスケジューリングとは何ですか？

GB200 NVL72 におけるクラスター分割とジョブスケジューリングはどのように機能するか？

GB200 NVL72 はどのようにしてより大きなセグメントサイズを実現するのか？

GB200 NVL72 のセグメントサイズ設定におけるベストプラクティスとは？

GB200 NVL72 システム上でのジョブスケジューリング方法

GB200 NVL72 のスケジューリングシミュレーションフレームワークとは何か？

シミュレーションパラメータ

シミュレーション結果は何を示しているか？

断片化分析

稼働率指標

GB200 NVL72 における最適なジョブスケジューリング手法とは何か？

NVIDIA GB200 NVL72 の活用開始

How does NVIDIA GB200 NVL72 deliver exascale compute?

What is topology-aware job scheduling?

How do cluster segmentation and job scheduling work on GB200 NVL72?

How does GB200 NVL72 enable larger segment sizes?

What are best practices for GB200 NVL72 segment sizing?

How to schedule jobs on GB200 NVL72 systems

What is the GB200 NVL72 scheduling simulation framework?

Simulation parameters

What do the simulation results show?

Fragmentation analysis

Occupancy metrics

What is the best job scheduling approach for GB200 NVL72?

Get started with NVIDIA GB200 NVL72

関連記事

キーポイント

影響分析

編集コメント

NVIDIA GB200 NVL72 はエクサスケール計算をどのように実現するのか？

トポロジ対応型ジョブスケジューリングとは何ですか？

GB200 NVL72 におけるクラスター分割とジョブスケジューリングはどのように機能するか？

GB200 NVL72 はどのようにしてより大きなセグメントサイズを実現するのか？

GB200 NVL72 のセグメントサイズ設定におけるベストプラクティスとは？

GB200 NVL72 システム上でのジョブスケジューリング方法

GB200 NVL72 のスケジューリングシミュレーションフレームワークとは何か？

シミュレーションパラメータ

シミュレーション結果は何を示しているか？

断片化分析

稼働率指標

GB200 NVL72 における最適なジョブスケジューリング手法とは何か？

NVIDIA GB200 NVL72 の活用開始

How does NVIDIA GB200 NVL72 deliver exascale compute?

What is topology-aware job scheduling?

How do cluster segmentation and job scheduling work on GB200 NVL72?

How does GB200 NVL72 enable larger segment sizes?

What are best practices for GB200 NVL72 segment sizing?

How to schedule jobs on GB200 NVL72 systems

What is the GB200 NVL72 scheduling simulation framework?

Simulation parameters

What do the simulation results show?

Fragmentation analysis

Occupancy metrics

What is the best job scheduling approach for GB200 NVL72?

Get started with NVIDIA GB200 NVL72

関連記事