NVIDIA Developer Blog·2026年6月26日 01:43·約12分で読める

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

#LLM #TensorRT #GPU #NVIDIA #大規模モデル

TL;DR

NVIDIA は TensorRT の新機能であるマルチデバイス推論サポートを発表し、複数の GPU にまたがって AI 推論を効率的にスケーリングする手法を確立した。

AI深層分析2026年6月26日 02:04

重要/ 5段階

深度40%

キーポイント

TensorRT マルチデバイス推論の導入

NVIDIA が TensorRT にマルチデバイス推論サポートを追加し、単一 GPU の制約を超えて複数の GPU リソースを統合的に活用可能にした。

大規模モデルの実行性能向上

メモリ容量や計算リソースのボトルネックとなる巨大な AI モデルでも、分散処理により高速かつ安定した推論が可能になる。

インフラ効率化とコスト削減

ハードウェアを効率的にプールすることで、大規模モデル運用におけるリソース利用率の向上と、導入・運用コストの最適化が期待される。

影響分析・編集コメントを表示

影響分析

この発表は、大規模言語モデル（LLM）や高度なマルチモーダルモデルの実用化において、ハードウェアリソースの制約を克服する決定的な手段を提供します。企業にとっては、高価な単一 GPU サーバーへの依存度を下げつつ、より複雑で高性能な AI アプリケーションを低遅延で運用できる道が開けるため、AI 推論インフラのパラダイムシフトを加速させる要因となります。

編集コメント

単一 GPU の限界を超え、複数カードをシームレスに連携させる技術は、現在急成長中の大規模モデル市場において不可欠なインフラ要件です。開発者はこの機能を活用することで、より複雑なワークロードへの対応力を飛躍的に高めることができます。

生成 AI ワークロードは、単一の GPU のメモリおよび計算リソースの予算を急速に上回っています。メディア生成パイプラインを構築する推論開発者にとっての課題は、NVIDIA TensorRT が本番環境向けデプロイメントで提供する重要な最適化（カーネル融合、メモリ計画、量子化など）を犠牲にすることなく、複数のデバイス間でスケーリングすることです。

TensorRT 11.0 で新たに導入されたマルチデバイス推論サポート機能により、TensorRT ランタイムにネイティブな高性能マルチ GPU 推論がもたらされ、エッジデバイスを対象とした本番環境でのマルチデバイスデプロイメントが可能になりました。

TensorRT のマルチデバイス推論サポートと Torch-TensorRT を組み合わせることで、開発者はフレームワーク外で巨大な PyTorch モデルを変換・展開でき、単一デバイスのメモリおよび計算リソースの限界を打破できます。

モデルにネイティブかつ高性能なマルチデバイスアクセラレーションを提供するには、NVIDIA Developer Portal からマルチデバイス推論サポート機能を備えた TensorRT 11.0 をダウンロードしてください。

NVIDIA NCCL: 分散推論のためのトランスポート層

NVIDIA Collective Communications Library (NCCL) は、数千もの GPU にわたる大規模モデルトレーニングを支える高性能なマルチ GPU およびマルチノードの集合演算機能を提供します。NCCL は、与えられたトポロジに対して最適な転送経路を自動的に選択し、NVIDIA NVLink、NVIDIA NVSwitch、PCIe、InfiniBand といった多様なハードウェアを単一の統一されたインターフェースの背後に抽象化して隠蔽しています。TensorRT が NCCL に直接統合されることで、マルチデバイス推論を実行する際、推論ワークロードに対してもこの転送最適化機能を継承します。NCCL に関する詳細については、https://developer.nvidia.com/nccl をご覧ください。

新しいマルチデバイス機能は、NVIDIA NCCL の分散集合演算の全セットをカバーしています：AllReduce、Broadcast、Reduce、AllGather、ReduceScatter、AlltoAll、Gather、および Scatter。

分散推論における並列化戦略

分散推論は、メモリ節約、計算のスケーラビリティ、通信オーバーヘッドの間に異なるトレードオフを持つ複数の並列化戦略を用いて表現できます。最も一般的な戦略はテンソル並列性とコンテキスト並列性です。

テンソル並列性

テンソル並列化では、単一の層の重みが複数の GPU にわたって分割されます。各 GPU はその層の行列乗算の一部を計算し、その後、集合演算を通じて部分的な結果を結合して完全な出力を生成します。これにより、デバイスあたりのメモリ使用量が削減され、入力シーケンス長やバッチサイズに関係なく、個々の層の重みが単一の GPU のメモリ容量を超える場合に、これが自然（かつしばしば唯一）の選択肢となります。

トランスフォーマーブロック内では、列並列投影（例えば QKV および MLP 上向き投影）が行並列投影（アテンション出力および MLP 下向き投影）とペアになり、各ブロックで単一の AllReduce 演算のみが必要となるように設計されています。これにより、通信オーバーヘッドを制限したままに保つことができます。

image*図 1. 列方向および行方向の並列投影*

コンテキスト並列化

コンテキスト並列化では、入力シーケンスがシーケンス次元に沿って複数の GPU にわたって分割されます。各 GPU はシーケンスの一部のみを処理し、アテンションなどの必要なタイミングで集合演算によってグローバルなシーケンスを利用可能にします。コンテキスト並列化は、特に長シーケンスワークロードにおいて効果的であり、そのような場合、アテンションのシーケンス長に対する二次的なスケーリングが計算とメモリの主要な消費要因となるためです。

また、双方向アテンションにより因果マスクに起因する負荷不均衡の問題を回避できる拡散モデルや DiT モデルにとっても、特に自然な適合性を示します。

コンテキスト並列性に関する詳細は、スケーラブルな百万トークン推論のためのコンテキスト並列性の記事をご覧ください。

NVIDIA TensorRT 11.0 では、各種並列化戦略に必要な IDistCollectiveLayer プリミティブのサポートが導入されました。本稿の後半では、現代の生成メディアパイプラインにおける主要なコストである長シーケンスアテンションに直接対応するコンテキスト並列性（Context Parallelism）に焦点を当てます。

生成メディアのためのコンテキスト並列性

拡散ベースの画像および動画生成パイプラインでは、計算リソースとメモリ予算の大部分が、長いトークンシーケンス上で動作するアテンションブロック内で消費されます。高解像度の画像潜在表現やマルチフレームの動画クリップは、1 ブロックあたり数万トークンのシーケンスを生成することがあり、アテンションの計算量はシーケンス長の二乗に比例して増加します。

AllGather KV

コンテキスト並列性は、シーケンスを GPU にわたって分割します。各ランクは、そのシーケンス分割に対応するクエリ (Q) のスライスを処理します。コンテキスト並列性を実装する単純な方法は AllGather KV アプローチで、これは各ランクがローカルアテンションを計算する前に、AllGather 集合演算を通じてキー (K) とバリュー (V) のシャードを交換し、各ランクがフルシーケンスに対してアテンションを実行できるようにするものです。その結果、1 つの追加的な集合演算を要する一方で、ローカルの Q × Kᵀ行列乗算はランク数に比例して縮小するため、各ランクでフルシーケンスをカバーするアテンション出力が得られます。

ビデオや高解像度画像拡散モデルにおいては、このトレードオフがデノイジングステップ全体で有利に作用します。1 ステップあたりの通信オーバーヘッドはシーケンス次元の AllGather によって制限されたままですが、計算量とメモリ使用量の削減効果は、すべてのステップにおけるすべてのアテンション層に適用されます。

image*図 2. コンテキスト並列性における AllGather KV ストラテジー*

リングアテンション

コンテキスト並列性はさまざまな方法で実装可能であり、それぞれが独自のトレードオフをもたらします。

AllGather KV メソッドに対する潜在的な改善策の一つに、通信と計算を重畳させる Ring Attention があります。これにより、各 GPU は K と V がリングトポロジ上で連続的にストリーミングされる間、ローカルの Q を同時に処理できます。Ring Attention はまた、メモリフットプリントも削減します：オンラインソフトマックスを使用することで、フルサイズの K および V テンソルを任意の GPU で実体化する必要がありません。Ring Attention について詳しくは、「Near-Infinite Context 向けのブロックワイズトランスフォーマーによる Ring Attention」記事をご覧ください。

image*図 3. コンテキスト並列化のための Ring Attention ストラテジー*

DeepSpeed Ulysses

長いコンテキスト（数万トークン）に対しては、別のコンテキスト並列化実装アプローチとして DeepSpeed Ulysses が挙げられます。これはまず、参加する GPU にわたってシーケンス次元に沿って個々のサンプルを分割します。アテンション計算の前には、分割された Q、K、V に対して all-to-all コミュニケーション集合演算を採用します。

これにより、各 GPU がシーケンス長のすべてを受け取りますが、注意ヘッドの重複しないサブセットに対してのみとなり、並列に注意計算を実行できるようになります。最後に、2 回目のオール・トゥー・オール通信によって、注意ヘッド全体で結果を集約しつつ、シーケンス次元に沿って再分割します。長いコンテキストにおけるコンテキスト並列性については、DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models の記事をご覧ください。

image*図 4. コンテキスト並列化のための DeepSpeed Ulysses ストラテジー*

ベンチマーク：C++ におけるコンテキスト並列化を用いたメディア生成

以下のベンチマークは、C++ での本番環境デプロイを想定したメディア生成ワークロードに対するマルチデバイス TensorRT 推論を評価するものです。2 つの代表的な生成 AI パイプラインを使用します：1 つは NVIDIA Cosmos 3 に基づく動画生成パイプライン、もう 1 つは FLUX.1 に基づく画像生成パイプラインです。

これらのパイプラインは最初に PyTorch で作成され、その後 Torch-TensorRT を用いてフレームワークから変換され、C++ 推論アプリケーションでのデプロイに適した NVIDIA TensorRT エンジンが生成されます。このワークフローにより、開発者はモデル開発環境として PyTorch を維持しつつ、本番システムでは最適化された TensorRT エンジンをデプロイすることが可能になります。

ベンチマークは、AllGather KV、Ring Attention、Ulysses の各文脈並列化戦略におけるエンドツーエンドのレイテンシを比較しています。すべての結果は、8 個の GPU を備えた単一ノードで収集されました。

NVIDIA Cosmos 3 による動画生成

NVIDIA Cosmos モデルプラットフォームは世界規模の基盤モデルプラットフォームであり、Cosmos3-Nano モデルは、テキスト、画像、動画を含むマルチモーダル入力に基づいて、画像、動画、音声、およびその他のフォーマットを生成できます。ベンチマークには例のプロンプトファイルを使用しました。これらのベンチマークによると、拡散モデルの文脈長が極端に長い（入力トークン数で数万オーダー）場合、Ulysses が明確な勝者となります。

image*図 5. 異なる CP 戦略を用いた N GPU 上の NVIDIA Cosmos 3 のエンドツーエンドレイテンシ（ミリ秒単位）*

image*図 6. 異なる文脈並列化戦略を用いた GPU 上の NVIDIA Cosmos 3 のバックボーン速度向上率*

image*図 7. 異なる CP 戦略を用いた 8 GPU 上での NVIDIA Cosmos 3 モデルのサンプル出力*

Flux.1 による画像生成

Black Forest Labs の FLUX.1-dev モデルは、テキストの説明から画像を生成することができます。ベンチマークには、「桜の時期の富士山の美しい写真」というプロンプトを使用しました。ベンチマークの結果に基づくと、画像生成においても Ulysses 戦略が勝利しましたが、Ring Attention も 4 GPU にわたってよくスケーリングしたことは注目に値します。

image*図 8. 異なる CP（Communication Parallelism: 通信並列）戦略を用いた N GPU 上の Flux E2E（End-to-End: エンドツーエンド）レイテンシ（ミリ秒単位）*

image*図 9. 異なる CP 戦略を用いた GPU 上の Flux バックボーンのスループット向上率*

image*図 10. 異なる CP 戦略を用いた 8 GPU 上での Black Forest Lab Flux.1 モデルのサンプル出力*

マルチデバイス機能を使用した TensorRT の使い方

TensorRT はマルチデバイス推論をサポートしており、統合された分散通信プリミティブを通じて単一のネットワークが複数の GPU にわたって実行可能になります。コアとなるワークフローはシングルデバイスの TensorRT と同様ですが、異なる点はネットワークに分散通信レイヤーを含められるようになったことです。

本ガイドでは、すべての GPU ランク（ランク：並列処理におけるノードやプロセスの識別子）で同じネットワークがデプロイされていることを前提としていますが、これは厳密な要件ではなく、理論的には各ランクが異なるモデルを実行することも可能です。

動作サンプルはTensorRT リポジトリで提供されています。以下のガイドでは、新しいマルチデバイス機能の使用方法をステップバイステップで説明します。

前提条件

NVIDIA Developer Portal から TensorRT 11 をダウンロードしてください。
以下の手順に従って TensorRT 11 をインストールしてください。
単一ノードかつマルチ GPU 環境のマシンを用意してください。
選択した開発環境（ベアメタルまたはコンテナ内）に OpenMPI をインストールしてください。
マルチデバイス推論用のネットワークを作成してください

ネットワークレベルでは、クロス GPU 通信のために IDistCollectiveLayer を通じてマルチデバイス推論が有効化されます。集合演算は INetworkDefinition::addDistCollective を使用して TensorRT ネットワークに直接追加できます：

using namespace nvinfer1;

auto network =

std::unique_ptr<INetworkDefinition>(builder->createNetworkV2(

1U << static_cast<uint32_t>(kSTRONGLY_TYPED)));

auto* input =

network->addInput("input", DataType::kFLOAT, Dims2{3, 4});

ITensor& inputTensor = *network->getInput(0);

auto* collectiveLayer = network->addDistCollective(

inputTensor,

CollectiveOperation::kALL_REDUCE,

ReduceOperation::kSUM,

-1,

nullptr,

);

collectiveLayer->setNbRanks(8);

For redu

原文を表示

Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the challenge is scaling across multiple devices without sacrificing the critical optimizations—like kernel fusions, memory planning, and quantization—that NVIDIA TensorRT delivers for production deployments.

Multi-device inference support, a new feature introduced in TensorRT 11.0, brings native high-performance multi-GPU inference to the TensorRT runtime, enabling multi-device production deployments targeting edge devices.

Combining the multi-device inference support in TensorRT with Torch-TensorRT, developers can convert and deploy massive PyTorch models out-of-framework, shattering single-device memory and compute limits.

Download TensorRT 11.0 with multi-device inference support from NVIDIA Developer Portal to unlock native, high-performance multi-device acceleration for your models.

NVIDIA NCCL: The transport layer for distributed inference

The NVIDIA Collective Communications Library (NCCL) provides high-performance multi-GPU and multi-node collective operations powering large-scale model training across thousands of GPUs. NCCL automatically selects the optimal transport for a given topology, abstracting NVIDIA NVLink, NVIDIA NVSwitch, PCIe, and InfiniBand behind a uniform interface. By integrating directly with NCCL, TensorRT inherits this transport optimization for inference workloads, when running multi-device inference. For more information on NCCL, see https://developer.nvidia.com/nccl.

The new multi-device feature covers the full set of NVIDIA NCCL distributed collectives: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, and Scatter.

Parallelism strategies for distributed inference

Distributed inference can be expressed using several parallelism strategies, each with different trade-offs between memory savings, compute scaling, and communication overhead. The most common strategies are tensor parallelism and context parallelism.

Tensor parallelism

In tensor parallelism, the weights of a single layer are partitioned across GPUs. Each GPU computes a shard of the layer’s matrix multiplication and then combines partial results through a collective to produce the full output. This reduces per-device memory weight, making it the natural (and often the only) choice when an individual layer’s weights exceed the memory of a single GPU, independent of the input sequence length or batch size.

In a transformer block, column-parallel projections (for example, QKV and the MLP up-projection) are paired with row-parallel projections (the attention output and the MLP down-projection) so that each block requires only a single AllReduce, keeping communication overhead bounded.

Figure 1. Column-wise and row-wise parallel projections

Context parallelism

In context parallelism, the input sequence is partitioned across GPUs along the sequence dimension. Each GPU processes only a slice of the sequence, while collective operations make the global sequence available where needed, such as during attention. Context parallelism is particularly effective for long-sequence workloads, where attention’s quadratic scaling with sequence length makes it the dominant consumer of compute and memory.

It is also an especially natural fit for diffusion and DiT models, whose bidirectional attention sidesteps the load-imbalance issues that arise with causal masks.

Read the Context Parallelism for Scalable Million-Token Inference article for additional details on context parallelism.

NVIDIA TensorRT 11.0 introduces support for the IDistCollectiveLayer primitives required by the various parallelization strategies. The remainder of this post focuses on context parallelism, which directly addresses the dominant cost in modern generative media pipelines: long-sequence attention.

Context parallelism for generative media

Diffusion-based image and video generation pipelines spend a large fraction of their compute and memory budget inside attention blocks operating over long token sequences. A high-resolution image latent or a multi-frame video clip can produce sequences of tens of thousands of tokens per block, and attention scales quadratically with sequence length.

AllGather KV

Context parallelism partitions the sequence across GPUs. Each rank processes a slice of the queries (Q) corresponding to its sequence partition. A straightforward way to implement context parallelism is the AllGather KV approach, where ranks exchange their key (K) and value (V) shards through an AllGather collective before computing local attention, enabling each rank to attend over the full sequence. The result is a per-rank attention output covering the full sequence at the cost of one additional collective per attention block, while the local Q × Kᵀ matrix multiplication shrinks proportionally to the number of ranks.

For video and high-resolution image diffusion, this trade-off compounds favorably across denoising steps. Communication overhead per step remains bounded by the sequence-dimension AllGather, while compute and memory savings apply to every attention layer in every step.

Figure 2. AllGather KV strategy for context parallelism

Ring Attention

Context parallelism can be implemented in various ways, each presenting distinct trade-offs.

One potential improvement over the AllGather KV method is Ring Attention, where communication and computation are overlapped. This enables each GPU to process its local Q simultaneously as the K and V continuously stream past in a ring topology. Ring Attention also reduces the memory footprint: using an online softmax, the full-size K and V tensors do not need to be materialized on any GPU. Read the Ring Attention with Blockwise Transformers for Near-Infinite Context article to learn more about Ring Attention.

Figure 3. Ring Attention strategy for context parallelism

DeepSpeed Ulysses

For long context (tens of thousands of tokens), an alternative context parallelism implementation approach is DeepSpeed Ulysses. It initially partitions individual samples along the sequence dimension across participating GPUs. Before the attention computation, it employs an all-to-all communication collective on the partitioned Q, K, and V.

This ensures that each GPU receives the full sequence length, but only for a non-overlapping subset of the attention heads, enabling them to compute attention in parallel. Finally, a second all-to-all communication gathers the results across the attention heads while repartitioning them along the sequence dimension. Read more about context parallelism for long context in the article DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.

Figure 4. DeepSpeed Ulysses strategy for context parallelism

Benchmarks: Media generation with context parallelism in C++

The following benchmarks evaluate multi-device TensorRT inference for media generation workloads intended for C++ production deployment. Two representative generative AI pipelines are used: a video generation pipeline based on NVIDIA Cosmos 3 and an image generation pipeline based on FLUX.1.

These pipelines were first authored in PyTorch, then converted out of the framework using Torch-TensorRT to produce NVIDIA TensorRT engines suitable for deployment in C++ inference applications. This workflow enables developers to retain PyTorch as the model development environment while deploying optimized TensorRT engines in production systems.

The benchmarks compare end-to-end latency across different context parallelism strategies: AllGather KV, Ring Attention, and Ulysses. All results were collected on a single node with 8 GPUs.

Video generation with NVIDIA Cosmos 3

The NVIDIA Cosmos model platform is a world foundation model platform, and the Cosmos3-Nano model can generate images, video, audio, and other formats based on multimodal inputs, including text, images, and video. We used the example prompt file for our benchmarks. Based on these benchmarks, Ulysses is the clear winner when a diffusion model has excessively long context lengths (in the order of tens of thousands of input tokens).

Figure 5. NVIDIA Cosmos 3 E2E latencies in milliseconds on N GPUs with different CP strategies

Figure 6. NVIDIA Cosmos 3 backbone speedup on GPUs with different context parallelism strategies

Figure 7. Sample outputs of the NVIDIA Cosmos 3 model on 8 GPUs with different CP strategies

Image generation with Flux.1

The FLUX.1-dev model from Black Forest Labs can generate images from text descriptions. We used the prompt: “a beautiful photograph of Mt. Fuji during cherry blossom” for our benchmarks. Based on the benchmarks, the Ulysses strategy is the winner in the case of image generation as well, but it’s worth noting that Ring Attention also scaled well to 4 GPUs.

Figure 8. Flux E2E latencies in milliseconds on N GPUs with different CP strategies

Figure 9. Flux backbone speedup on GPUs with different CP strategies

Figure 10. Sample outputs of the Black Forest Lab Flux.1 model on 8 GPUs with different CP strategies

Getting started using TensorRT with the multi-device feature

TensorRT supports multi-device inference, enabling a single network to execute across multiple GPUs through integrated distributed communication primitives. The core workflow is similar to that of single-device TensorRT. The difference is that the network can now include distributed communication layers.

In this guide, it’s assumed that the same network is deployed on all GPU ranks, but this isn’t a strict requirement, and, in theory, each rank can run a different model.

A working sample is provided in the TensorRT repository. The following guide provides a step-by-step description of how to use the new multi-device feature.

Prerequisites

Download TensorRT 11 from the NVIDIA Developer Portal.

Install TensorRT 11 following these instructions.

Get a single-node, multi-GPU machine.

Install OpenMPI in your chosen development environment (bare metal or in a container)

Create a network for multi-device inference

At the network level, multi-device inference is enabled through IDistCollectiveLayer for cross-GPU communication. Collective operations can be added directly to a TensorRT network using INetworkDefinition::addDistCollective:

using namespace nvinfer1;

auto network =

std::unique_ptr<INetworkDefinition>(builder->createNetworkV2(

1U << static_cast<uint32_t>(kSTRONGLY_TYPED)));

auto* input =

network->addInput("input", DataType::kFLOAT, Dims2{3, 4});

ITensor& inputTensor = *network->getInput(0);

auto* collectiveLayer = network->addDistCollective(

inputTensor,

CollectiveOperation::kALL_REDUCE,

ReduceOperation::kSUM,

-1,

nullptr,

0

);

collectiveLayer->setNbRanks(8);

For redu

この記事をシェア

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

Anthropic が提供する AI チャットボット「Claude」が、従来 ChatGPT が独占していた有料顧客市場において支持を集め、シェア拡大に成功していることが示された。

AWS Machine Learning Blog★42026年6月26日 01:41

NVIDIA Blackwell を用いた Amazon SageMaker AI でのモデル学習の最適化

AWS は、NVIDIA の最新 GPU「Blackwell」を活用することで、Amazon SageMaker AI 上で大規模 AI モデルの学習におけるメモリ制約やシーケンス長の制限といった課題を克服し、実用的な運用範囲を広げる方法を発表した。

NVIDIA Developer Blog★42026年6月26日 01:38

NVIDIA ACE を活用した KRAFTON の共演可能キャラクター「PUBG Ally」の構築方法

ゲーム開発会社 KRAFTON は、NVIDIA の AI 技術プラットフォーム「ACE」を活用し、プレイヤーと対話可能な共演可能キャラクター「PUBG Ally」を PUBG に実装した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年6月26日 01:43·約12分で読める

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

#LLM #TensorRT #GPU #NVIDIA #大規模モデル

TL;DR

NVIDIA は TensorRT の新機能であるマルチデバイス推論サポートを発表し、複数の GPU にまたがって AI 推論を効率的にスケーリングする手法を確立した。

AI深層分析2026年6月26日 02:04

重要/ 5段階

深度40%

キーポイント

TensorRT マルチデバイス推論の導入

NVIDIA が TensorRT にマルチデバイス推論サポートを追加し、単一 GPU の制約を超えて複数の GPU リソースを統合的に活用可能にした。

大規模モデルの実行性能向上

メモリ容量や計算リソースのボトルネックとなる巨大な AI モデルでも、分散処理により高速かつ安定した推論が可能になる。

インフラ効率化とコスト削減

ハードウェアを効率的にプールすることで、大規模モデル運用におけるリソース利用率の向上と、導入・運用コストの最適化が期待される。

影響分析・編集コメントを表示

影響分析

編集コメント

NVIDIA NCCL: 分散推論のためのトランスポート層

分散推論における並列化戦略

テンソル並列性

image*図 1. 列方向および行方向の並列投影*

コンテキスト並列化

コンテキスト並列性に関する詳細は、スケーラブルな百万トークン推論のためのコンテキスト並列性の記事をご覧ください。

生成メディアのためのコンテキスト並列性

AllGather KV

image*図 2. コンテキスト並列性における AllGather KV ストラテジー*

リングアテンション

コンテキスト並列性はさまざまな方法で実装可能であり、それぞれが独自のトレードオフをもたらします。

image*図 3. コンテキスト並列化のための Ring Attention ストラテジー*

DeepSpeed Ulysses

image*図 4. コンテキスト並列化のための DeepSpeed Ulysses ストラテジー*

ベンチマーク：C++ におけるコンテキスト並列化を用いたメディア生成

NVIDIA Cosmos 3 による動画生成

image*図 5. 異なる CP 戦略を用いた N GPU 上の NVIDIA Cosmos 3 のエンドツーエンドレイテンシ（ミリ秒単位）*

image*図 6. 異なる文脈並列化戦略を用いた GPU 上の NVIDIA Cosmos 3 のバックボーン速度向上率*

image*図 7. 異なる CP 戦略を用いた 8 GPU 上での NVIDIA Cosmos 3 モデルのサンプル出力*

Flux.1 による画像生成

image*図 8. 異なる CP（Communication Parallelism: 通信並列）戦略を用いた N GPU 上の Flux E2E（End-to-End: エンドツーエンド）レイテンシ（ミリ秒単位）*

image*図 9. 異なる CP 戦略を用いた GPU 上の Flux バックボーンのスループット向上率*

image*図 10. 異なる CP 戦略を用いた 8 GPU 上での Black Forest Lab Flux.1 モデルのサンプル出力*

マルチデバイス機能を使用した TensorRT の使い方

前提条件

NVIDIA Developer Portal から TensorRT 11 をダウンロードしてください。
以下の手順に従って TensorRT 11 をインストールしてください。
単一ノードかつマルチ GPU 環境のマシンを用意してください。
選択した開発環境（ベアメタルまたはコンテナ内）に OpenMPI をインストールしてください。
マルチデバイス推論用のネットワークを作成してください

using namespace nvinfer1;

auto network =

std::unique_ptr<INetworkDefinition>(builder->createNetworkV2(

1U << static_cast<uint32_t>(kSTRONGLY_TYPED)));

auto* input =

network->addInput("input", DataType::kFLOAT, Dims2{3, 4});

ITensor& inputTensor = *network->getInput(0);

auto* collectiveLayer = network->addDistCollective(

inputTensor,

CollectiveOperation::kALL_REDUCE,

ReduceOperation::kSUM,

-1,

nullptr,

);

collectiveLayer->setNbRanks(8);

For redu

原文を表示

Download TensorRT 11.0 with multi-device inference support from NVIDIA Developer Portal to unlock native, high-performance multi-device acceleration for your models.

NVIDIA NCCL: The transport layer for distributed inference

The new multi-device feature covers the full set of NVIDIA NCCL distributed collectives: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, and Scatter.

Parallelism strategies for distributed inference

Tensor parallelism

Context parallelism

It is also an especially natural fit for diffusion and DiT models, whose bidirectional attention sidesteps the load-imbalance issues that arise with causal masks.

Read the Context Parallelism for Scalable Million-Token Inference article for additional details on context parallelism.

Context parallelism for generative media

AllGather KV

Ring Attention

Context parallelism can be implemented in various ways, each presenting distinct trade-offs.

DeepSpeed Ulysses

Benchmarks: Media generation with context parallelism in C++

The benchmarks compare end-to-end latency across different context parallelism strategies: AllGather KV, Ring Attention, and Ulysses. All results were collected on a single node with 8 GPUs.

Video generation with NVIDIA Cosmos 3

Image generation with Flux.1

Getting started using TensorRT with the multi-device feature

In this guide, it’s assumed that the same network is deployed on all GPU ranks, but this isn’t a strict requirement, and, in theory, each rank can run a different model.

A working sample is provided in the TensorRT repository. The following guide provides a step-by-step description of how to use the new multi-device feature.

Prerequisites

Download TensorRT 11 from the NVIDIA Developer Portal.

Install TensorRT 11 following these instructions.

Get a single-node, multi-GPU machine.

Install OpenMPI in your chosen development environment (bare metal or in a container)

Create a network for multi-device inference

using namespace nvinfer1;

auto network =

std::unique_ptr<INetworkDefinition>(builder->createNetworkV2(

1U << static_cast<uint32_t>(kSTRONGLY_TYPED)));

auto* input =

network->addInput("input", DataType::kFLOAT, Dims2{3, 4});

ITensor& inputTensor = *network->getInput(0);

auto* collectiveLayer = network->addDistCollective(

inputTensor,

CollectiveOperation::kALL_REDUCE,

ReduceOperation::kSUM,

-1,

nullptr,

0

);

collectiveLayer->setNbRanks(8);

For redu

この記事をシェア

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

AWS Machine Learning Blog★42026年6月26日 01:41

NVIDIA Blackwell を用いた Amazon SageMaker AI でのモデル学習の最適化

NVIDIA Developer Blog★42026年6月26日 01:38

NVIDIA ACE を活用した KRAFTON の共演可能キャラクター「PUBG Ally」の構築方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

NVIDIA NCCL: 分散推論のためのトランスポート層

分散推論における並列化戦略

テンソル並列性

コンテキスト並列化

生成メディアのためのコンテキスト並列性

AllGather KV

リングアテンション

DeepSpeed Ulysses

ベンチマーク：C++ におけるコンテキスト並列化を用いたメディア生成

NVIDIA Cosmos 3 による動画生成

Flux.1 による画像生成

マルチデバイス機能を使用した TensorRT の使い方

NVIDIA NCCL: The transport layer for distributed inference

Parallelism strategies for distributed inference

Tensor parallelism

Context parallelism

Context parallelism for generative media

AllGather KV

Ring Attention

DeepSpeed Ulysses

Benchmarks: Media generation with context parallelism in C++

Video generation with NVIDIA Cosmos 3

Image generation with Flux.1

Getting started using TensorRT with the multi-device feature

関連記事

キーポイント

影響分析

編集コメント

NVIDIA NCCL: 分散推論のためのトランスポート層

分散推論における並列化戦略

テンソル並列性

コンテキスト並列化

生成メディアのためのコンテキスト並列性

AllGather KV

リングアテンション

DeepSpeed Ulysses

ベンチマーク：C++ におけるコンテキスト並列化を用いたメディア生成

NVIDIA Cosmos 3 による動画生成

Flux.1 による画像生成

マルチデバイス機能を使用した TensorRT の使い方

NVIDIA NCCL: The transport layer for distributed inference

Parallelism strategies for distributed inference

Tensor parallelism

Context parallelism

Context parallelism for generative media

AllGather KV

Ring Attention

DeepSpeed Ulysses

Benchmarks: Media generation with context parallelism in C++

Video generation with NVIDIA Cosmos 3

Image generation with Flux.1

Getting started using TensorRT with the multi-device feature

関連記事