NVIDIA Developer Blog·2026年4月2日 00:00·約1分で読める

NVIDIAプラットフォーム、極限の共同設計により最低トークンコストを実現

#AIインフラ #推論最適化 #トークンコスト #協調設計 #NVIDIAプラットフォーム #AIファクトリー

TL;DR

NVIDIAは、ハードウェア、ソフトウェア、モデルを極限まで協調設計（Co-design）したプラットフォームにより、AIファクトリーの最高スループットとトークンコストの最小化を実現したと発表した。

AI深層分析2026年4月4日 06:42

重要/ 5段階

深度40%

キーポイント

極限の協調設計によるコスト最適化

ハードウェア、ソフトウェア、AIモデルを一体として設計（Co-design）することで、AI推論のトークンあたりコストを業界最低水準にまで引き下げることに成功した。

AIファクトリーのスループット最大化

単なるピーク性能ではなく、AIファクトリー全体のスループットを最大化する設計思想を採用し、大規模AI運用の効率性を追求している。

測定基準の高度化

性能評価が「ピーク性能」を超え、実際の運用環境における総合的な効率（トークンコスト、スループット）を重視した新しい測定基準が示されている。

プラットフォーム全体の最適化アプローチ

個別コンポーネントの性能向上ではなく、NVIDIAプラットフォーム全体としての統合的な最適化が、今回の成果の鍵であるとしている。

影響分析・編集コメントを表示

影響分析

この発表は、AIインフラ競争の焦点が単純な演算性能から「総所有コスト（TCO）」や「運用効率」へと移行していることを示唆する。NVIDIAがプラットフォーム全体の垂直統合による最適化で競合他社に対する優位性を築こうとする戦略が明確になり、クラウドプロバイダーや大規模AI事業者への価値提案が強化される。

編集コメント

AI運用コストの最適化は業界全体の喫緊の課題であり、NVIDIAがプラットフォーム統合でこの問題に正面から取り組む姿勢を示した点で注目に値する。ただし、具体的な数値比較や競合製品とのベンチマークが示されていないため、実際の優位性の程度は検証待ちと言える。

共同設計されたハードウェア、ソフトウェア、およびモデルは、最高のAIファクトリースループットと最低のトークンコストを達成するための鍵です。この性能を測定するには、ピーク性能だけを見るはるか先を行く検討が必要となります...

原文を表示

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.

MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.

This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.

Figure 1. NVIDIA Delivers 9x More Cumulative MLPerf Training and Inference wins

This post takes a closer look at the latest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.

New benchmarks, new performance records

The MLPerf Inference benchmark suite is routinely updated to ensure that it reflects models, modalities, use cases, and deployment scenarios that matter to the community. Only the NVIDIA platform submitted results on all newly added models and scenarios this round, and delivered the highest performance across all of them.

This round of MLPerf Inference added several new tests, including:

DeepSeek-R1 Interactive: Following the addition of DeepSeek-R1 reasoning LLM based on a sparse mixture-of-experts (MoE) architecture in MLPerf Inference v5.1, MLCommons added a new Interactive scenario with 5x faster minimum token rate and 1.3x shorter time to first token compared to the server scenario, representing higher-interactivity deployments

Qwen3-VL-235B-A22B: Vision-language model with a total of 235B parameters. This represents the first multi-modal model in the MLPerf Inference suite. Two scenarios are tested: Offline and Server.

GPT-OSS-120B: 120B-parameter MoE reasoning LLM, developed by OpenAI. This benchmark includes three scenarios: Offline, Server, and Interactive

WAN-2.2-T2V-A14B: 4B-parameter text-to-video generative AI model. Two scenarios tested: single-stream, which measures the latency to process a single video generation request, and offline, which measures the number of samples processed per second in a batch-processing scenario.

DLRMv3 – A generative recommendation benchmark that replaces the DLRM-DCNv2 test. It uses a transformer-based architecture that increases model size and compute intensity compared to the prior benchmark. It tests offline and server scenarios.

BenchmarkDeepSeek-R1GPT-OSS-120BQwen3-VLWan 2.2DLRMv3Offline2,494,310 tokens/sec*1,046,150 tokens/sec79 samples/sec0.059 samples/sec104,637 samples/secServer1,555,110 tokens/sec*1,096,770 tokens/sec68 queries/sec21 secs(Single Stream)99,997 queries/secInteractive250,634 tokens/sec677,199 tokens/sec***Table 1. NVIDIA platform throughput on newly added workloads and scenarios in MLPerf Inference v6.0

Not a new scenario in MLPerf Inference v6.0 Wan 2.2 features a single stream scenario, which measures end-to-end request latency, instead of a server scenario. Lower is better.* Not tested in MLPerf Inference v6.0

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0039, 6.0-0073, 6.0-0075, 6.0-0076, 6.0-0078, 6.0-0081, 6.0-0094. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Figure 2. NVIDIA achieves 2.7x performance gain and 2.5M token/s on DeepSeek-R1

NVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUs

NVIDIA continually optimizes the performance of its software stack to increase delivered token throughput from existing platforms. This delivers reductions in token production cost and enables AI factory operators to serve more users to generate more revenue with a given infrastructure footprint.

The additional performance also provides headroom to run future AI models and serve existing models in demanding scenarios, such as higher token rates and longer contexts. This continual improvement makes it possible for NVIDIA GPUs introduced years ago to remain productive, at high utilization rates, in the cloud.

This round, NVIDIA GB300 NVL72—launched last year—delivered up to 2.7x higher token throughput compared to its debut submissions just six months ago on the server scenario of the DeepSeek-R1 benchmark1. This means 2.7x more tokens from the same GB300 NVL72-based infrastructure and power footprint, reducing the cost to manufacture each token by more than 60%. This speedup, achieved by NVIDIA partner Nebius, showcases a core advantage of the NVIDIA platform: an open, expansive ecosystem where customers and partners can uniquely optimize and innovate on top of our software stack.

1MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0081. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Powering the DeepSeek R1 performance improvements in the server and offline scenarios were several software enhancements, including:

Faster kernels—this included a combination of higher-performance kernels and the use of fewer kernels because of kernel fusions.

Optimized Attention Data Parallel—Better balancing of context requests between different ranks, enabling significant speedups in end-to-end performance.

The latest features of the open source NVIDIA TensorRT-LLM inference serving software and the NVIDIA Dynamo open source distributed inference serving framework were used to support the newly added and more challenging DeepSeek-R1 Interactive scenario. This includes:

Disaggregated serving: This capability in Dynamo separates and individually optimizes the configurations of each inference phase (prefill and decode), respectively, enabling optimal overall throughput.

Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.

Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes unutilized to predict and verify additional tokens in parallel (up to three in this implementation), throughput at high interactivity is increased.

KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different workers.

NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year. This round, NVIDIA not only increased performance on returning scenarios for DeepSeek-R1 but ‌was once again the only platform to submit on the newly added interactive scenario.

And even on Llama 3.1 405B—a very large, dense LLM launched almost two years ago— GB300 NVL72 performance increased by 1.5x in the server scenario.

BenchmarkGB300 NVL72 v5.1GB300 NVL72v6.0SpeedupDeepSeek-R1(Server)2,907 tokens/sec/gpu8,064 tokens/sec/gpu2.77xDeepSeek-R1(Offline)5,842 tokens/sec/gpu9,821 tokens/sec/gpu1.68xLlama 3.1 405B(Server)170 tokens/sec/gpu259 tokens/sec/gpu1.52xLlama 3.1 405B(Offline)224 tokens/sec/gpu271 tokens/sec/gpu1.21xTable 2. Performance improvements, normalized on a per-GPU basis, on DeepSeek-R1 and Llama 3.1 405B server and offline scenarios in v6.0 compared to v5.1

MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0017, 6.0-0078, 6.0-0082. Per chip performance is derived by dividing total throughput by the number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Additionally, NVIDIA submissions on the newly added multimodal, video generation, and recommendation benchmarks were powered by open source software frameworks optimized for the NVIDIA platform. The Qwen3-VL vision-language submission used the vLLM open source framework, showing how the community is rapidly building advanced multimodal optimizations to accelerate image-heavy inference workloads on the latest GPUs like NVIDIA Blackwell Ultra. The WAN-2.2 text-to-video submission used the TensorRT-LLM VisualGen, which accelerates diffusion-based video generation pipelines on NVIDIA GPUs.

For DLRMv3, the submission was built on two open-source projects: the NVIDIA recsys-example for high-performance transformer-based recommendation inference, and NV Embedding Cache for GPU-accelerated embedding table lookups. Both were critical to achieving record throughput on this more demanding generative recommendation benchmark.

Through extensive and ongoing engineering, NVIDIA is continually increasing performance on existing hardware on existing models, as evidenced by these results. At the same time, NVIDIA collaborates closely with model builders and open source inference frameworks to ensure that the latest models run on the NVIDIA platform on the day of launch.

Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per second

NVIDIA also set new throughput records at scale on the DeepSeek-R1 model in the offline and server scenarios by submitting results using four GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand scale-out networking.

DeepSeek-R1 | 4x GB300 NVL72Tokens/SecondOffline2,494,310Server1,555,110Table 3. DeepSeek-R1 throughput on four GB300 NVL72 systems scaled up with NVLink and scaled out with NVIDIA Quantum-X800 InfiniBand

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

With 288 Blackwell Ultra GPUs—the largest scale ever submitted to any benchmark in MLPerf Inference—submissions set new system-level throughput records, enabling millions of tokens processed per second.

Looking ahead to MLPerf Endpoints

Delivered inference throughput takes extreme co-design across many chips, system architecture, data center design, and software. The latest MLPerf Inference v6.0 results show that NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.

AI inference workloads also continue to evolve rapidly, as model sizes grow and context lengths rise. As agentic AI becomes more prevalent, premium use cases that require ultra-fast token rates are emerging.

NVIDIA has been working, as part of the MLCommons consortium, to lead the definition of the MLPerf Endpoints benchmark. MLPerf Endpoints will give the community a rigorous, auditable picture of how deployed services perform under real API traffic—capturing key performance metrics that chip-level benchmarks alone cannot reveal—while providing the rigor and result integrity that defines MLPerf benchmarks.

To explore the latest performance on the NVIDIA platform across training, inference, and high-performance computing, please see our deep learning product performance page.

Acknowledgements

NVIDIA MLPerf Inference v6.0 results reflect the work of many talented engineers across the company. We’d like to acknowledge the contributions of the following individuals (last name sorted):

Vedaanta Agarwalla, Tomar Bar-on, Nitin Sai Bommi, John Angel Calderon Espinoza, Bin Chai, Viraat Chandra, Alice Cheng, Jerry Chen, Xiaoming Chen, Jesus Corbal San Adrian, Ashutosh Dhar, Kefeng Duan, Yubo Gao, Anerudhan Gopal, Wookje Han, Max Hu, Kyle Huang, Kris Hung, Rashid Kaleem, Khubaib Khubaib, Zihao Kong, Tin-Yin Lai, Tao Li, Forrest Lin, Wanqian Li, Alex Liu, Mingyuan Ma, Baorun Mu, Jintao Peng, Yuxian Qiu, Junyi Qiu, Xiaowei Shi, Qidong Su, Olivia Stoner, Jacob Subag, Jiayu Sun, Tong Tong, Harshil Vagadia, Shobhit Verma, Shang Wang, June Yang, Tailing Yuan, Ben Zhang, Zhanda Zhu, and many others across NVIDIA whose efforts made these results possible.

この記事をシェア

Preferred Networks★42026年4月3日 10:00

自律稼働デバイス向け高精度軽量VLM「PLaMo 2.1-VL」

Preferred Networksは、経済産業省とNEDOのプロジェクト支援を受け、自律稼働デバイス向けの高精度軽量Vision Language Model「PLaMo 2.1-VL」を開発した。8Bサイズと2Bサイズの2モデルを提供し、デバイス上での動作を可能にした。

Smol AI News★42026年6月4日 14:44

今日は何も大きな出来事はありませんでした

Smol AI News は、6月3日から4日にかけての期間に、12件のサブレッドや544件のツイートを調査しましたが、AI業界で特筆すべき動きは確認されませんでした。

TLDR AI★42026年6月2日 09:00

米国、中国企業向け Nvidia 最高級チップの海外販売ループホールを閉鎖へ

米商務省は、本社が中国にある企業の海外子会社による NVIDIA 製先端チップ購入にも輸出ライセンス要件を適用する指針を発出し、既存の回避策を封じる方針を示した。

ニュース一覧に戻る元記事を読む