速度を落とさないハードウェア基盤の AI セキュリティ

著者について

原文を表示

AI has transformed how organizations operate, driving unprecedented levels of productivity and innovation. However, AI adoption can be impeded by concerns surrounding data privacy, sovereignty and how to secure data while it is in use, or during inference and engagement with AI models. NVIDIA Confidential Computing (CC) was engineered to be a secure and performant solution for the era of agentic AI to scale any model securely.

CC enables the protection of enterprise data and proprietary model weights and the model itself during active inference. In this post, we will provide an overview of CC and demonstrate benchmarks that show its inference performance is nearly identical (up to 98%) to solutions that don’t enable CC security.

Data, code, and model integrity

CC provides a security layer that spans silicon, interconnect, and system software. Here’s how it works:

Figure 1. Confidential Computing provides data and code integrity and confidentiality

Hardware root of trust

NVIDIA Blackwell GPUs, including the NVIDIA RTX PRO 6000, HGX B200, and HGX B300, are engineered with CC embedded in the hardware. The HGX B200 and HGX B300 GPUs support confidential computing across multiple GPUs (up to 8) with NVIDIA NVLink encryption. At the silicon level, the GPU maintains a private signing key that is fused at the time of manufacturing and never exposed to software, firmware, or the host system. This key is the foundation of the attestation chain.

Attestation: Verification before execution

Before a confidential workload receives any secrets, it undergoes remote attestation. The NVIDIA Remote Attestation Service (NRAS) verifies a signed evidence bundle—the GPU’s hardware report combined with CPU TEE measurements (AMD SEV-SNP or Intel TDX)—against a known-good reference integrity manifest (RIM).

Once the Confidential VM (CVM) is in a verified, unmodified state, secrets such as model decryption keys can be deployed into the CVM. The attestation handshake is typically a one-time startup event. Once the workload is running, attestation does not add latency to individual inference requests.

![Diagram showing the attestation process, where the NVIDIA Remote Attestation Service (NRAS) verifies the hardware report and CPU TEE measurements against a reference integrity manifest to validate the Trusted Execution Environment before secrets are deployed.

](https://developer-blogs.nvidia.com/wp-content/uploads/2026/07/image-1-1-625x267.png)

*Figure 2. Attestation services remotely validate the identity, configuration, and integrity of Trusted Execution Environments and issue cryptographic proof*

Optimizing AI inference performance in Confidential Computing

CC changes to AI inference performance on Blackwell GPUs can come from two areas:

Secure work submission latency: For inference, secure work submission latency is often the larger factor and due to the added overhead from encryption and kernel launches, smaller units of work are more affected. Increasing the amount of work performed per GPU work launch reduces the impact of the secure launch overhead.

Reduced host-to-device CPU-to-GPU bandwidth: If a workload depends heavily on transferring inputs to the GPU, performance will depend on whether the required bandwidth to keep the GPU fully utilized exceeds the encrypted transfer bandwidth available in CC mode.

Several innovations optimize inference performance with CC including:

CC-safe autotuner timing: FlashInfer replaces event timers in CC mode with the GPU global timer register, allowing autotuners to accurately compare kernel candidates and select the fastest implementation for each shape.

Async D2H copy worker: SGLang moves per-step token readback off the scheduler’s critical path. This helps restore compute/copy overlap because CC can otherwise make many host-to-device and device-to-host copies effectively synchronous during cudaMemcpyAsync.

Piecewise CUDA graph support: SGLang adds CUDA graph replay for prefill and mixed batches, reducing kernel launch overhead that is amplified in CC mode.

NVIDIA continues to work with upstream communities for inference frameworks to ensure these frameworks are optimized for performance.

We measured the inference performance of CC across different key metrics. Below are the details on the test setup and measurements.

Benchmark results

*Across all workload configurations tested, enabling CC mode produced minimal throughput and time per output token overhead during steady-state inference.*

The following table summarizes CC throughput, TTFT, TPOT overhead on Blackwell Ultra (HGX B300) for model Qwen/Qwen3.5-397B-A17B-FP8

Relative Performance of Confidential Computing

Test Setup

Benchmark: Qwen 3.5 397B-A17B model at FP8 precisionEnvironment: Virtual Machine with GPU passthroughBaseline: Confidential Computing OffExperiment: Confidential Computing On

All other variables held constant.

Hardware Configurations

HGX B300 with Blackwell Ultra.

Software Stack

Note: Please follow the CPU power and vCPU pinning configuration described in this document.

Workload Parameters

Each configuration was tested across a range of conditions representative of real enterprise inference workloads:

Input/output token lengths: 8192/1024, 1024/1024Batch sizes: 4, 8, 16, 32, 64, 128 and 256 concurrent requests. Inference framework (Mode): SGLang (Server)Baseline: Without –enable-symm-mem

Metrics Collected

Output Throughput per GPU (tokens/sec/gpu)Median Time to First Token (TTFT) — latency from request submission to first token generated, in msMedian Time Per Output Token (TPOT) — per-token generation latency in steady-state streaming, in ms

Path forward

Hardware-level security with CC protects sensitive AI workloads while preserving the performance needed for production AI workloads.

CC provides a stronger security foundation for production inference workloads with minimal performance overheads. In our evaluation using Qwen 3.5 on SGLang, we observed this across a sweep of concurrency levels, input sequence lengths, and output sequence lengths, proving that organizations can secure their AI workloads and data, and stay compliant to regulation without compromising on performance.

Join NVIDIA and our partners to secure your AI workloads with CC on Blackwell by accessing the resources below.

Resources

NVIDIA Developer Blog重要度42026年7月2日 02:04

About the Authors

この記事をシェア

エージェント技術の習得：AI エージェント強化学習

NVIDIA Developer Blog2026年7月1日 02:36

NVIDIA GQE を用いた GPU アクセラレーションされたクエリエンジンの設計

NVIDIA Developer Blog2026年7月1日 01:00

NVIDIA Nsight 開発ツールを用いたニューラル再構築パイプラインの最適化

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年7月3日 06:25·約8分

速度を落とさないハードウェア基盤の AI セキュリティ

#AI Security #Hardware Acceleration #NVIDIA GPU #Trusted Execution

TL;DR

NVIDIA は、AI システムの保護を強化しつつ処理速度を低下させない新しいハードウェアベースのセキュリティ技術を発表した。

AI深層分析2026年7月3日 07:02

重要/ 5段階

深度40%

キーポイント

パフォーマンス劣化なしのセキュリティ実装

ハードウェアルートの信頼性向上

AI モデルや推論プロセスが改ざんされないよう、ファームウェアやチップレベルでの根幹からセキュリティを担保する技術を採用している。

実環境への即座の適用可能性

影響分析・編集コメントを表示

影響分析

編集コメント

データ、コード、およびモデルの整合性

CC は、シリコン、相互接続、システムソフトウェアにわたるセキュリティレイヤーを提供します。その仕組みは以下の通りです。

image*図 1. Confidential Computing はデータとコードの整合性と機密性を提供します***

ハードウェア基盤の信頼性

証明：実行前の検証

image

*図 2. 証明サービスは、信頼実行環境のアイデンティティ、構成、および整合性をリモートで検証し、暗号化された証明を発行します*

機密コンピューティングにおける AI 推論パフォーマンスの最適化

Blackwell GPU における AI インファレンス性能への CC（Confidential Computing）の変更は、2 つの領域から生じます：

セキュアなワーク提出遅延：インファレンスにおいては、セキュリティを確保したワークの提出遅延がしばしば主要な要因となり、暗号化とカーネル起動に伴うオーバーヘッドにより、小さな単位でのワークほど影響を受けやすくなります。1 回の GPU ワーク起動あたりの処理量を増やすことで、セキュア起動のオーバーヘッドによる影響を軽減できます。

ホストからデバイスへの CPU から GPU への帯域幅の低下：ワークロードが GPU への入力転送に大きく依存している場合、GPU を完全に活用するために必要な帯域幅が、CC モードで利用可能な暗号化転送帯域幅を超えているかどうかによって性能が決まります。

CC を用いたインファレンス性能を最適化するいくつかの革新技術には以下が含まれます：

CC 対応オートチューナ計時：FlashInfer は CC モードにおいてイベントタイマーを GPU グローバルタイマレジスタに置き換え、オートチューナがカーネル候補を正確に比較し、各形状に対して最速の実装を選択できるようにしています。

非同期 D2H コピーワーカー：SGLang は、ステップごとのトークン読み取りをスケジューラのクリティカルパスから外します。これにより、CC モードでは cudaMemcpyAsync 中に多くのホストからデバイスおよびデバイスからホストへのコピーが事実上同期的になるのを防ぎ、計算とコピーのオーバーラップを回復するのに役立ちます。

パースピース CUDA グラフサポート：SGLang は、プリフィルとミックスドバッチに対して CUDA グラフのリプレイを追加し、CC モードで増幅されるカーネル起動のオーバーヘッドを削減します。

NVIDIA は、推論フレームワークの性能を最適化するために、引き続きアップストリームコミュニティと連携して取り組んでいます。

CC の推論パフォーマンスをさまざまな主要指標で測定しました。以下にテスト設定と測定の詳細を示します。

ベンチマーク結果

機密コンピューティングの相対パフォーマンス

並行性**ISL/OSL = 1024 / 1024ISL/OSL = 8192 / 1024

スループット/GPU (tok/s)Median TPOT (ms)スループット/GPU (tok/s)Median TPOT (ms)

OFF に対するΔ%OFF に対するΔ%OFF に対するΔ%OFF に対するΔ%

4-2.0%-1.6%-3.5%-3.6%

8-2.6%-2.4%-2.8%-2.9%

16-5.3%-4.9%-2.8%-3.0%

32-6.3%-7.8%-1.0%-0.9%

64-6.2%-6.8%-2.3%-2.4%

128-7.5%-8.1%-3.5%-3.5%

256-4.6%-4.1%-3.6%-3.7%

*表 1. NVIDIA Confidential Computing（機密コンピューティング）の有効化による相対パフォーマンスへの影響***

テスト設定

その他の変数はすべて一定に保たれました。

ハードウェア構成

Blackwell Ultra を搭載した HGX B300。

ソフトウェアスタック

コンポーネントバージョン / 詳細

プラットフォーム Intel TDX

ホスト OS Ubuntu 25.10

ホストカーネル 6.17.0-20-generic

ゲスト OS Ubuntu 24.04.4 LTS

ゲストカーネル 6.8.0-124-generic

ゲスト vCPU 256

ゲスト NUMA 2 ノード

NVIDIA ドライバ 595.71.05

VBIOS FW 1.4.x [97.10.64.00.0C]

GPU パワーリミット 1100.00

CUDA 13.2

SGlang docker.io/lmsysorg/sglang:v0.5.12-cu130 プルリクエスト: 28251 (SGLang) および 3638 (FlashInfer)

NCCL v2.28.9-1

OpenSSL 3.6.0

オーケストレーション Docker コンテナ + NVIDIA Container Toolkit

*表 2. テストセットアップのソフトウェア構成*

注: このドキュメントに記載されている CPU パワーおよび vCPU ピンニング設定に従ってください。

ワークロードパラメータ

各構成は、実際のエンタープライズ推論ワークロードを代表する一連の条件下でテストされました:

収集した指標

今後の道筋

以下のリソースにアクセスして、Blackwell 上で CC を活用した NVIDIA とパートナーと共に、AI ワークロードのセキュリティ強化にご参加ください。

リソース

著者について

原文を表示

Data, code, and model integrity

CC provides a security layer that spans silicon, interconnect, and system software. Here’s how it works:

Hardware root of trust

Attestation: Verification before execution

](https://developer-blogs.nvidia.com/wp-content/uploads/2026/07/image-1-1-625x267.png)

*Figure 2. Attestation services remotely validate the identity, configuration, and integrity of Trusted Execution Environments and issue cryptographic proof*

Optimizing AI inference performance in Confidential Computing

CC changes to AI inference performance on Blackwell GPUs can come from two areas:

Secure work submission latency: For inference, secure work submission latency is often the larger factor and due to the added overhead from encryption and kernel launches, smaller units of work are more affected. Increasing the amount of work performed per GPU work launch reduces the impact of the secure launch overhead.

Reduced host-to-device CPU-to-GPU bandwidth: If a workload depends heavily on transferring inputs to the GPU, performance will depend on whether the required bandwidth to keep the GPU fully utilized exceeds the encrypted transfer bandwidth available in CC mode.

Several innovations optimize inference performance with CC including:

CC-safe autotuner timing: FlashInfer replaces event timers in CC mode with the GPU global timer register, allowing autotuners to accurately compare kernel candidates and select the fastest implementation for each shape.

Async D2H copy worker: SGLang moves per-step token readback off the scheduler’s critical path. This helps restore compute/copy overlap because CC can otherwise make many host-to-device and device-to-host copies effectively synchronous during cudaMemcpyAsync.

Piecewise CUDA graph support: SGLang adds CUDA graph replay for prefill and mixed batches, reducing kernel launch overhead that is amplified in CC mode.

NVIDIA continues to work with upstream communities for inference frameworks to ensure these frameworks are optimized for performance.

We measured the inference performance of CC across different key metrics. Below are the details on the test setup and measurements.

Benchmark results

*Across all workload configurations tested, enabling CC mode produced minimal throughput and time per output token overhead during steady-state inference.*

The following table summarizes CC throughput, TTFT, TPOT overhead on Blackwell Ultra (HGX B300) for model Qwen/Qwen3.5-397B-A17B-FP8

Relative Performance of Confidential Computing

Test Setup

Benchmark: Qwen 3.5 397B-A17B model at FP8 precisionEnvironment: Virtual Machine with GPU passthroughBaseline: Confidential Computing OffExperiment: Confidential Computing On

All other variables held constant.

Hardware Configurations

HGX B300 with Blackwell Ultra.

Software Stack

Note: Please follow the CPU power and vCPU pinning configuration described in this document.

Workload Parameters

Each configuration was tested across a range of conditions representative of real enterprise inference workloads:

Metrics Collected

Path forward

Hardware-level security with CC protects sensitive AI workloads while preserving the performance needed for production AI workloads.

Join NVIDIA and our partners to secure your AI workloads with CC on Blackwell by accessing the resources below.

Resources