NVIDIA Developer Blog·2026年5月8日 01:02·約9分

NCCL Inspector と Prometheus を用いたリアルタイムパフォーマンス監視と高速デバッグ

#分散学習 #GPU コミュニケーション #NVIDIA NCCL #Prometheus #インフラ最適化

TL;DR

NVIDIA は、分散型深層学習の GPU グラフィック間通信を可視化・最適化する新ツール「NCCL Inspector」を発表し、Prometheus と連携することでリアルタイムパフォーマンス監視と高速デバッグを実現した。

AI深層分析2026年5月8日 02:02

重要/ 5段階

深度40%

キーポイント

NCCL Inspector の機能紹介

NVIDIA Collective Communication Library (NCCL) の通信ボトルネックを可視化し、トレーニング中の遅延や非効率な通信パターンを特定するための専用監視ツール。

Prometheus との統合によるリアルタイム監視

既存のインフラストラクチャである Prometheus とシームレスに連携し、メトリクス収集と可視化を自動化することで、運用負荷を大幅に低減。

大規模分散学習への適用効果

数百乃至数千の GPU を使用する大規模モデルトレーニングにおいて、通信層のボトルネックを迅速に特定し、リソース効率とトレーニング速度を向上させる。

影響分析・編集コメントを表示

影響分析

このツールは、大規模分散学習環境における通信層のブラックボックス化という課題を解決し、開発者がリソース不足や通信遅延の原因を迅速に特定することを可能にする。結果として、AI モデルトレーニングのサイクルタイム短縮と、高価な GPU クラスタのリソース効率最大化に寄与する。

編集コメント

大規模モデルトレーニングの効率化において、計算リソースだけでなく通信層の可視化が不可欠となる中、NVIDIA が公式に提供するこの監視ツールは、現場の運用負荷を劇的に軽減する重要なステップです。

分散型ディープラーニングは、NVIDIA Collective Communication Library (NCCL) を用いた高速で信頼性の高い GPU 間通信に依存しています。トレーニングが低速化すると、その原因を特定し、次に何をすべきかを判断することが困難になります。問題が発生する範囲は、計算処理、通信プロセス、特定のランク（rank）、あるいは基盤となるハードウェア全体に及ぶ可能性があります。

NVIDIA NCCL Inspector は、NCCL 通信パフォーマンスの軽量かつ継続的なレポートを提供することで、トラブルシューティングを加速します。これは各ランクにおける操作タイプ、サイズ、帯域幅を追跡し、今回の最新機能強化により、最小限のオーバーヘッドでリアルタイム分析を可能にします。

また、最適なトレーニングレシピの決定にも役立ちます。以前の投稿では NCCL Inspector のオフラインモードについて紹介しました。微細な分析は詳細データ調査の標準的な手法として残りますが、本稿ではリアルタイム監視という新機能を紹介します。NCCL Inspector を Prometheus Exporter と統合することで、ユーザーのインフラストラクチャダッシュボード内で直接ライブ時系列可視化が可能になりました。

NCCL Inspector のデプロイメントアーキテクチャ

NCCL 2.30 では、AI ワークロードにおける NCCL のリアルタイムパフォーマンス監視のための主要な機能強化としてPrometheus Modeが導入されました。NCCL Inspector は、図 1 および図 2 に示す 2 つのモードで動作します。

image*図 1. JSON モード（デフォルト/オフラインモード）における NCCL Inspector*

JSON モードは、データ収集フェーズとデータ分析フェーズの 2 つの段階で動作します。まず、データ収集フェーズでは各ランクからのパフォーマンスメトリクスが生成され、個別に JSON ファイルとして保存されます（通常は共有ストレージ上）。次に、データ分析フェーズでこのデータを処理します。この手法は処理がリアルタイムで行われないため、「オフライン」として分類されます。

image*図 2. リアルタイム Prometheus モードにおける NCCL Inspector*

この新機能では、NCCL Inspector のメトリクスを Prometheus と統合し、Grafana ダッシュボードでの可視化に適した時系列データに変換します。Prometheus モードは、JSON モードで以前必要だった大容量のストレージ要件を不要にします。このメトリクスデータは、ノードエクスポートア（node exporter）によって Prometheus—a scalable, cloud-native platform（スケーラブルなクラウドネイティブプラットフォーム）へ転送されます。NCCL ジョブ出力ファイルは連続的に上書きされるように設計されています。ノードエクスポートアがメトリクスを収集した後、ディスク上にデータを保持する必要はなくなります。

Prometheus モードの実験的セットアップ

NCCL Inspector プロファイラープラグインを設定するには、まずプラグインをビルドし、以下の必須環境変数を設定する必要があります：

NCCL_PROFILER_PLUGIN=/path/to/nccl/plugins/profiler/inspector/libnccl-profiler-inspector.so

NCCL_INSPECTOR_ENABLE=1

NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=3000000

NCCL_INSPECTOR_PROM_DUMP=1

NCCL_INSPECTOR_DUMP_DIR=/path/to/node/exporter/log/location

ダンプスレッド間隔とダンプディレクトリは、使用するノードエクスプローターに応じて設定および調整する必要があります。設定が完了すると、NCCL Inspector がプロセスを開始し、集合通信のパフォーマンスをNCCL_INSPECTOR_DUMP_DIRにダンプします。その後、Prometheus Node Exporter がこれらのメトリクスを Prometheus 時系列データベースへ送信します。最後に、これらの時系列メトリクスは Graphana を用いてダッシュボードグラフとしてレンダリングされます。

ジョブ実行時には、メトリクスが以下の形式のファイルに保存されます：nccl_inspector_metrics_<uuid_of_the_gpu>.prom

GPU の UUID がファイル名に含まれるのは、マルチユーザー環境では CUDA デバイス ID が重複する可能性があるためです。

NCCL ジョブ出力ファイルは Prometheus 公開形式（Prometheus exposition format）です。各メトリクスには、NCCL バージョン、Slurm ジョブ ID、ノード、GPU、コミュニケーター名、ノード数、ランク数、メッセージサイズなどのコンテキストがラベル付けされています。以下に例を示します：

nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 19.1634

nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 92.8984

nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 19.2396

nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 92.5781

nccl_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter"(集約散乱),message_size="134-135MB",algo_proto="RING_SIMPLE"} 44.1181

nccl_collective_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter"(集約散乱),message_size="134-135MB",algo_proto="RING_SIMPLE"} 104164

これらのメトリクスが Prometheus データベースに格納された後、次のステップは Grafana でそれらを可視化することです。

時系列に基づく Grafana ダッシュボード

図 3 は、Prometheus のラベルを NVLink 集約ダッシュボードと混合（つまり、ネットワーク＋NVLink 集約）に分類して使用した場合の、時系列ダッシュボードの例を示しています。

image*図 3. NVIDIA NVLink のみの通信グループにおける NCCL AllGather バス帯域幅（GB/s）を示す Grafana タイムシリーズダッシュボード。単一ノード（n_nodes==1）で 6 分間の観測期間にわたって計測*

image*図 4. マルチノード環境（n_nodes==4）における NCCL AllGather バス帯域幅（GB/s）を示す Grafana タイムシリーズダッシュボード。IB/RoCE/EFA を含むネットワーク通信と NVLink 通信を併用したケースで、6 分間の観測期間にわたって計測*

NCCL Inspector のユースケース

トラブルシューティングのワークフローを実証するために、以下の 2 つのユースケースでは、ダッシュボードが根本原因の特定をどのように加速するかを示します。

リアルタイム可視化

長時間実行される AI ワークロードにおけるパフォーマンスの低下の原因究明には、ライブダッシュボードを活用してください。ダッシュボード上の変化を観察し、ジョブレベルでの性能劣化と、その背後にある NCCL やネットワーク層のメトリクスを相関させることで、異常が発生した箇所に基づいた標的型のトラブルシューティングが可能になります。この戦略を実証するために、チームは大規模な大規模言語モデル（LLM）の前学習ジョブを実行しました。

タイムライン A：通常のワークフロー**

図 5 は、実験の 1 つにおける混合ネットワーク＋NVLink コレクティブの AllGather バス帯域幅を示しています。この AI 前学習ワークロードの計算性能は*約 310 TFLOPs/GPU*でした。

image*図 5. 4 ノードにおける通常の AI プリートレーニングワークフロー中の混合ネットワーク＋NVLink コミュニケータに対する NCCL AllGather バス帯域幅（GB/s）を示す Grafana タイムシリーズダッシュボード。観測された計算性能は約 310 TFLOPs/GPU に相当*。

タイムライン B: ネットワーク起因の遅延

人工的なネットワーク制約を導入した後、混合ネットワーク＋NVLink の集合通信における AllGather バス帯域幅（BusBw）は、計算性能が GPU あたり約 268 TFLOPs に低下し（ベースラインと比較して約 13% の劣化）、その傾向を示しました。

この例は、リアルタイムダッシュボードが混合トランスポート（ネットワーク＋NVLink）の集合通信パフォーマンスの観測性を向上させ、根本原因の特定を迅速化し、平均修復時間（MTTR: Mean Time To Resolution）を短縮できることを示しています。

image*図 6. タイムライン B（ネットワーク起因の遅延シナリオ）中の混合ネットワーク＋NVLink コミュニケータに対する NCCL AllGather バス帯域幅（GB/s）を示す Grafana タイムシリーズダッシュボード*。

パフォーマンスの帰属分析

もう一つのユースケースとして、特定の期間におけるパフォーマンス劣化を分析する支援を行う NCCL Inspector があります。例えば、ある実験では以下のように一時的に性能が低下しました：

[2026-03-19 14:39:47.098640] -> GPU あたりのスループット：約 314 TFLOP/s/GPU

[2026-03-19 14:40:48.696103] -> GPU あたりのスループット：約 295 TFLOP/s/GPU

[2026-03-19 14:42:00.816450] -> GPU あたりのスループット：約 289 TFLOP/s/GPU

[2026-03-19 14:44:02.304347] -> GPU あたりのスループット：約 311 TFLOP/s/GPU

次に、観測された性能劣化が、この期間中にネットワークの異常と相関しているかどうかを確認します。

image*図 7. 2026 年 3 月 1 日の性能アトリビューション調査中の NVLink のみを使用するコミュニケータにおける NCCL ReduceScatter バス帯域幅 (GB/s) を示す Grafana タイムシリーズダッシュボード*

image*図 8. 同じく 2026 年 3 月 19 日の性能アトリビューション調査期間中の、ネットワークと NVLink を組み合わせたコミュニケータにおける NCCL ReduceScatter バス帯域幅 (GB/s) を示す Grafana タイムシリーズダッシュボード*

このダッシュボードは、混合トランスポート通信（ネットワークベースおよび NVLink ベースのコレクティブ）において性能劣化が生じていることを示しています。この相関関係から、根本原因がネットワーク上の断絶または輻輳であることがわかります。これにより、ホスト単位およびネットワークごとのカウンターを詳細に調査し、どの箇所で遅延が発生したかを特定することが可能になります。

リアルタイム観測性のための次のステップ

Prometheus との統合を備えた NCCL Inspector の導入は、AI ワークロードのパフォーマンス分析におけるネットワークの可観測性を強化するために設計されています。この強力な組み合わせにより、パフォーマンス分析に対するより科学的なアプローチが可能になります。ユーザーは、実行中のワークロードの実時間パフォーマンス特性をデバッグ・理解し、遅延の原因を特定し、パラメータを微調整し、詳細なメトリクスを用いてその結果生じるパフォーマンスの変化を測定することができます。

始め方

GitHub の <a href="https://github.com/NVIDIA/nccl/blob/master/plugins/ を参照してください。

原文を表示

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down, it becomes challenging to determine why and what to do next. A problem can span computation, communication, a specific rank, or underlying hardware.

NVIDIA NCCL Inspector accelerates triaging by providing a lightweight and continuous report of NCCL communication performance. It tracks operation type, size, and bandwidth across every rank, and with this latest enhancement, can facilitate real-time analysis with minimal overhead.

It also helps determine the optimal training recipe. A previous post introduced NCCL Inspector offline mode. While fine-grained analysis remains the standard for deep-dive data, this post introduces real-time monitoring, a new feature. Live, time-series visualizations can now be powered directly within a user’s infrastructure dashboard by integrating NCCL Inspector with Prometheus Exporter.

NCCL Inspector deployment architecture

NCCL 2.30 introduces Prometheus Mode, a major enhancement for real-time performance monitoring of NCCL in AI workloads. The NCCL Inspector works in two modes, shown in Figures 1 and 2.

Figure 1. NCCL Inspector in JSON mode (default/offline mode)

The JSON mode operates in a data collection and data analysis phase. First, the data collection phase generates performance metrics from each rank and stores them individually in a JSON file, typically on shared storage. Then, the data analysis phase processes the data. This method is considered offline since the processing isn’t completed in real time.

Figure 2. NCCL Inspector in real-time Prometheus mode

This new feature integrates NCCL Inspector metrics with Prometheus, converting them into time-series data suitable for visualization in Grafana dashboards. Prometheus mode eliminates the large storage requirements previously necessary for JSON mode. This metric data is moved by the node exporter to Prometheus—a scalable, cloud-native platform. The NCCL job output file is designed to be overwritten continuously. Once the node exporter collects the metrics, they’re no longer needed on disk.

Experimental setup for Prometheus Mode

Setting up the NCCL Inspector Profiler plugin requires building the plugin and setting the following required environment variables:

NCCL_PROFILER_PLUGIN=/path/to/nccl/plugins/profiler/inspector/libnccl-profiler-inspector.so

NCCL_INSPECTOR_ENABLE=1

NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=3000000

NCCL_INSPECTOR_PROM_DUMP=1

NCCL_INSPECTOR_DUMP_DIR=/path/to/node/exporter/log/location

The dump thread interval and dump directory should be set and tuned according to the node exporter used. Once configured, NCCL Inspector starts the process and dumps collective performance into the NCCL_INSPECTOR_DUMP_DIR. The Prometheus Node Exporter then sends the metrics to the Prometheus time-series database. Finally, these time-series metrics are rendered as dashboard graphs with Graphana.

When running the job, the metrics are saved to a file with the format: nccl_inspector_metrics_<uuid_of_the_gpu>.prom

The UUID of the GPU is included in the file name since CUDA device IDs can overlap in a multi-user environment.

The NCCL job output file is in the Prometheus exposition format. Each metric is labeled with context, including NCCL version, Slurm job ID, node, GPU, communicator name, number of nodes, number of ranks, and message size. The following is an example:

nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 19.1634

nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 92.8984

nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 19.2396

nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 92.5781

nccl_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter",message_size="134-135MB",algo_proto="RING_SIMPLE"} 44.1181

nccl_collective_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter",message_size="134-135MB",algo_proto="RING_SIMPLE"} 104164

Once these metrics land in a Prometheus DB, the next step is rendering them in Grafana.

Time series-based Grafana dashboards

Figure 3 shows an example of how time series dashboards look using the Prometheus labels categorized into NVLink collective dashboards and mixed i.e., Network + NVLink collectives:

Figure 3. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for NVIDIA NVLink-only communicators on a single node (n_nodes==1), observed over a 6-minute window

Figure 4. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for combined network (IB/RoCE/EFA) and NVLink communicators in a multi-node setting (n_nodes==4), observed over a six-minute window

Use cases for NCCL inspector

To demonstrate the triage workflow, these two use cases highlight how the dashboards accelerate root cause identification.

Live observability

Use live dashboards for finding the root cause of performance slowdowns in a long-running AI workload. Observing changes on dashboards and correlating job-level degradations with underlying NCCL or network-layer metrics enables targeted triage based on where the anomaly originates. The team ran a large LLM pre-training job to show this strategy.

Timeline A: Normal workflow

Figure 5 shows the AllGather bus bandwidth for the mixed network + NVLink collectives in one of the experiments. The compute performance for this AI pretraining workload was *~310 TFLOPs/GPU*.

Figure 5. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during a normal AI pretraining workflow on four nodes, corresponding to an observed compute performance of ~310 TFLOPs/GPU

Timeline B: Network-induced slowdown

After introducing artificial network constraints, AllGather BusBw for mixed network + NVLink collectives shows compute performance decreased to ~268 TFLOPs per GPU (~13% degradation vs. baseline).

This example shows that a real-time dashboard improves observability of collective performance across mixed transport communicators (network + NVLink), enabling faster root cause identification and reducing mean time to resolution.

Figure 6. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during Timeline B, a network-induced slowdown scenario

Performance attribution

Another use case is the NCCL Inspector, which helps analyze performance degradation over a specific time period. For example, in one experiment, the performance degrades temporarily as shown:

[2026-03-19 14:39:47.098640] -> throughput per GPU: ~314 TFLOP/s/GPU

[2026-03-19 14:40:48.696103] -> throughput per GPU: ~295 TFLOP/s/GPU

[2026-03-19 14:42:00.816450] -> throughput per GPU: ~289 TFLOP/s/GPU

[2026-03-19 14:44:02.304347] -> throughput per GPU: ~311 TFLOP/s/GPU

Next, the observed degradation is examined to determine whether it correlates with a network anomaly during this period.

Figure 7. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for NVLink-only communicators during a performance attribution investigation on 2026-03-1

Figure 8. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for mixed Network + NVLink communicators during the same performance attribution window on 2026-03-19

The dashboard shows performance degradation in mixed transport communication (network + NVLink-based collectives). This correlation indicates that the root cause is a disruption/congestion in the network. This enables drilling down into per-host and network counters to isolate where the slowdown occurred.

Next steps for real-time observability

The introduction of NCCL Inspector with Prometheus integration is designed to enhance network observability for AI workload performance analysis. This powerful combination enables a more scientific approach to performance analysis. Users can debug and understand the real-time performance characteristics of a running workload, triage slowdowns, fine-tune parameters, and measure the resulting performance changes using detailed metrics.

Get started

Refer to the GitHub <a href="https://github.com/NVIDIA/nccl/blob/master/plugins/

この記事をシェア

NVIDIA Developer Blog重要度42026年6月27日 01:00

NVIDIA モデル最適化器を用いた NVIDIA Nemotron 3 Ultra NVFP4 チェックポイントの作成方法

NVIDIA Developer Blog重要度42026年6月26日 07:25

Vulkan デスクリプタヒープの包括的サポートによるリソースバインディングの効率化

NVIDIA Developer Blog重要度42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む