読み込み中…

Hugging Face Blog·2026年6月5日 03:57·約18分

Nemotron 3.5 コンテンツセーフティ：グローバル企業向けカスタマイズ可能なマルチモーダル安全性

#Nemotron #コンテンツセーフティ #マルチモーダル #エンタープライズ AI #コンプライアンス

TL;DR

NVIDIA は Hugging Face で、グローバルエンタープライズ向けにカスタマイズ可能な多モーダル安全性モデル「Nemotron 3.5 Content Safety」を公開し、企業 AI のリスク管理とコンプライアンス対応能力を強化した。

AI深層分析2026年6月11日 00:18

重要/ 5段階

深度40%

キーポイント

カスタマイズ可能な多モーダル安全性の実現

テキストだけでなく画像や音声を含む多様なデータ形式に対応し、企業の独自ポリシーに基づいて柔軟に調整可能な安全性フィルタリングを提供する。

グローバルエンタープライズ向けのコンプライアンス対応

地域ごとの規制要件や業界特有のリスク基準に合わせてモデルを微調整できるため、大規模組織における安全な AI 導入を支援する。

Hugging Face エコシステムでの公開と統合

NVIDIA の最新技術を Hugging Face で直接利用可能にし、開発者が容易にモデルを取得・デプロイできる環境を整備した。

安全データセットの公開

Nemotron 3.5 では、トレーニングおよび評価に使用された多言語・多モーダルの安全データセットと推論トレースを初めて公開し、OSSモデルやマルチモーダル領域におけるデータ不足の問題に対処しています。

Gemma 3 ベースのアーキテクチャ

Google Gemma 3 4B IT を基盤とし、LoRA アダプタで安全分類機能を追加することで、8GB以上のVRAMを持つGPU上でリアルタイムに動作可能なコンパクトなモデルを実現しています。

柔軟な推論モード

レイテンシが制約となる場合は低遅延バイナリ判定へ切り替え可能で、カテゴリ分類やステップバイステップの推論トレースを含む「THINK mode」など、用途に応じた3つの出力モードをサポートしています。

カスタマイズ可能かつ監査可能な推論機能

自然言語で定義されたドメイン固有ポリシーを動的に解釈・適用でき、判断に至るまでのステップごとの根拠（推論トレース）を提供することで、規制業界におけるコンプライアンスと監査を支援します。

重要な引用

Customizable Multimodal Safety for Global Enterprise AI

Tailor safety filters to your specific organizational policies and regional compliance requirements.

With Nemotron 3.5, we are releasing our safety dataset. This is an important milestone since most OSS safety models don't generally provide the training or evaluation sets.

NVIDIA fine-tunes this base with a LoRA adapter that installs targeted safety classification behavior while keeping the model compact enough for real-time deployment on 8GB+ VRAM GPUs.

「99% of training images are real photographs—not synthetic generations. This directly addresses a known weakness in the multimodal safety benchmark landscape...」

「Reasoning allows a content safety model to dynamically interpret and enforce custom, domain-specific policies defined in natural language at the time of inference.」

影響分析・編集コメントを表示

影響分析

この発表は、大規模言語モデルや生成 AI の普及に伴う安全性リスクへの対応において、企業側が独自のカスタマイズ権限を持つことを可能にする画期的なステップです。特に規制の厳しい業界や地域で AI を導入する際、汎用的なフィルタリングではなく自社の基準に合わせた厳格な管理を実現できるため、実用面でのインパクトは極めて大きいです。

編集コメント

生成 AI の実用化において「安全性」は最大のボトルネックの一つですが、NVIDIA が提供するこのカスタマイズ機能により、企業は自社のリスク許容度に応じた柔軟な運用が可能になります。特にグローバル展開を想定する大企業にとって、即戦力となる重要なインフラ整備と言えます。

記事一覧に戻る

Nemotron 3.5 コンテンツセーフティの新機能

統合型マルチモーダル評価
グローバル言語対応
カスタムポリシーの強制適用
推論トレース（THINK モード）
セーフティデータセット

モデルアーキテクチャ

推論

トレーニングデータ

ベンチマーク

レイテンシ

ベンチマークギャップへの対応

はじめに

過去 2 年間で、NVIDIA のコンテンツセーフティスタックは、集中的な英語テキスト分類器から、新しいモダリティ、言語、推論モードに対応する専門モデルのファミリーへと成長しました。2026 年 3 月にリリースされた Nemotron 3 コンテンツセーフティは、4B パラメータの単一モデルにおいて初めてマルチモーダルと多言語機能を統合しました。本日、私たちは Nemotron 3.5 コンテンツセーフティをリリースします。これは、その進化を完結させるものであり、マルチモーダル入力、多言語対応、カスタム企業ポリシーの強制適用、監査可能な推論を単一の推呼出で統合したモデルです。

本記事では、3.5 版における変更点、各新機能の背後にある設計判断、および生産環境のセーフティパイプラインへのモデル統合方法について解説します。

Nemotron 3.5 コンテンツセーフティの新機能

1. 統合型マルチモーダル評価

Nemotron 3 は画像理解を導入しましたが、Nemotron 3.5 はマルチモーダル統合をさらに深化させました。このモデルは、ユーザープロンプト、オプションの画像、およびオプションのアシスタント応答を単一のコンテキストウィンドウとして受け取り、結合された入力全体に対して一貫した安全性判定を生成します。各要素を個別に評価するのではなく、これら 3 つを同時に評価することで、マルチモーダル安全シナリオにおけるよく知られたギャップが解消されます。つまり、テキストと画像の間、あるいは要求と応答の間の「相互作用」によって初めて顕在化するポリシー違反も、単一のパスで検出できるようになりました。

2. グローバル言語対応

Nemotron 3.5 は、先行モデルと同様に英語、フランス語、スペイン語、ドイツ語、中国語、日本語、韓国語、アラビア語、ヒンディー語、ロシア語、ポルトガル語、イタリア語の 12 か国語に対する明示的なトレーニングカバレッジを維持しつつ、Gemma 3 ベースモデルから約 140 か国語にわたる強力なゼロショット汎化能力も継承しています。これは、トレーニングデータが不足している市場（例えば東南アジア諸国の言語、スカンジナビア諸国の言語、リソースが限られたアフリカ諸国の言語など）での展開において、個別のファインチューニングを必要とせずにベースモデルの多言語転移学習の恩恵を受けられることを意味します。

3. カスタムポリシーの実行

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

これは、Nemotron 3.5 における 3.5 版への最も重要なアーキテクチャ上の追加です。本番環境での運用では、単一の普遍的な安全性分類体系の下で動作することはめったにありません。ヘルスケアプラットフォームは、金融サービス用チャットボット、開発者向けツール IDE、あるいは子供向けの教育アプリとは異なるリスクプロファイルを持っています。Nemotron 3.5 は、入力とともにカスタムポリシー仕様を受け付けます。モデルは、組み込みの分類体系に完全に委ねるのではなく、そのポリシーに基づいて推論を行い、判断を下します。これは、Nemotron Content Safety Reasoning 4B で最初に紹介された取り組みを、フルマルチモーダル・多言語設定へと拡張したものです。

4. 推論トレース (THINK モード)

Nemotron 3.5 のすべての安全性判断には、オプションの思考モード(think mode) を介して監査可能な推論トレースを付随させることができます。有効にすると、モデルは最終的な安全/不安全ラベルと、必要に応じて違反カテゴリを出力する前に、段階的な推論結果を出力します。

ユーザー安全性：不安全

レスポンス安全性：不安全

安全カテゴリ：犯罪計画/自白、規制薬物

レイテンシが主要な制約となる場合、THINK モードを無効化して、Nemotron 3 で利用可能な同じ低遅延の二値判定（バイナリ verdict）に切り替えることができます。

5. セーフティデータセット

Nemotron 3.5 では、私どものセーフティデータセットを公開します。これは重要なマイルストーンです。なぜなら、ほとんどのオープンソース（OSS）のセーフティモデルは、トレーニング用または評価用のセットを提供しないことが一般的だからです。この問題は、画像や動画などのアーティファクトが、しばしば制限付きライセンス条項を持つリソースから派生するマルチモーダル領域において、より深刻になります。Nemotron 3.5 コンテンツセーフティデータセットはマルチモーダルかつ多言語に対応しており、モデルのトレーニングに使用されたセーフティ推論トレース（reasoning traces）も含まれています。これらの推論トレースは、Nemotron Content Safety Reasoning 4B モデルと同様に、簡潔にするために 2 ステップの手法で生成されました。

モデルアーキテクチャ

Nemotron 3.5 コンテンツセーフティは、Google Gemma 3 4B IT（パラメータ数 40 億）を基盤として構築されており、128K のコンテキストウィンドウ、強力なビジョン・ランゲージ推論能力、広範な多言語カバレッジを提供します。NVIDIA は、このベースモデルに LoRA アダプター（Low-Rank Adaptation）を適用して微調整を行っており、これによりリアルタイムデプロイが可能な 8GB 以上の VRAM を備えた GPU 上で実行できるほどコンパクトなモデルでありながら、ターゲットを絞ったセーフティ分類動作を実装しています。

推論インターフェースは 3 つの出力モードをサポートします：

モード 1 — 低遅延二値判定：

ユーザー安全性：安全

レスポンス安全性：不安全

モード2 — カテゴリ付きの二値判定:

ユーザー安全性：安全

レスポンス安全性：不安全

安全性カテゴリ：暴力、犯罪計画/自白

モード3 — THINK モード（推論＋判定）:

ユーザー安全性：不安全

レスポンス安全性：不安全

安全性カテゴリ：[カテゴリ]

安全性タクソノミーは、Aegis 2.0 フレームワークに従います。これは MLCommons の安全性タクソノミーと整合性のある13の主要カテゴリに加え、10の微細なサブカテゴリで構成されています。この整合性により、Aegis タクソノミーデータセット上でベンチマークされた他のオープンおよびクローズド型のガードシステムとの直接比較が可能になります。

推論

推論は、コンテンツ安全性分類におけるスーパーチャージャーです。なぜなら、特にエンタープライズや規制環境における生産用AIシステムにとって、必要な文脈、カスタマイズ性、説明責任を提供するからです。

カスタムおよび文脈に応じたポリシー適用を可能にする

推論により、コンテンツ安全性モデルは、推論時に自然言語で定義されたカスタムかつドメイン固有のポリシーを動的に解釈し、適用することが可能になります。これは、生産環境での展開が単一の普遍的な安全性タクソノミーの下で動作することはめったにないため、必要不可欠です。金融サービスチャットボットと子供向け教育アプリではリスクプロファイルが異なり、後者は汚言に対する許容度が低い傾向があります。この機能は以下をサポートします：

カテゴリの抑制：「プロセスを停止する」というフレーズをDevOpsツールが扱う際に、「暴力」カテゴリのトリガーを防ぐなど、無関係なカテゴリを無効化すること。

カスタムカテゴリの注入：組織の規制や製品ポリシーに固有のリスクカテゴリを定義する機能。

監査可能かつ文書化された根拠の提供

推論トレースは、最終的に安全または不安全という判断を下す前にモデルが示す段階的なロジックを示します。この文書化された根拠には以下の目的があります：

コンプライアンスと監査ログ：規制業界では、コンテンツモデレーションの決定に対する文書化された根拠を要求されることが多くあります。
人的レビュー：レビュアーは判断に至った理由を検証し、モデルに系統的なエラーがないかを確認できます。
ポリシーの反復改善：トレースからは、モデルが境界ケースをどのように解釈しているかが明らかになり、チームはカスタムポリシーの文言を反復的に精緻化・改善することが可能になります。

レイテンシ

推論は遅延を引き起こす可能性がありますが、Nemotron モデルは推論チェーンを簡潔な要約に圧縮することでこの課題に対処し、出力トークンを制限して効率を高めています。これは、先行モデル Nemotron-Content-Safety-Reasoning-4B で実施されたのと同様の 2 ステッププロセスで行われます。最初のステップでは、Qwen 397B のようなより大規模で強力なモデルを使用して、提供されたプロンプト、画像、および応答に基づいて思考連鎖（chain-of-thought）の推論トレースを生成します。また、推論トレースに誤分類が混入するのを防ぐため、サンプルの正解ラベルも提供しています。2 番目のステップでは、Qwen 80B のような別の大規模モデルを使用して、これらの推論トレースをより簡潔なものにします。具体的には、このモデルに対して元のトレース（1 ステップ目）を書き換えさせ、3 文以内に収まるように指示を出しています。実験によると、生成される推論トレースのほとんどが 3 文以内です。

効率的な推論トレースの最適化により、低遅延のカスタムポリシー適用が可能になります。さらに、推論トレースは専門的なモデレーターモデルを訓練するために使用できる貴重なトレーニングシグナルを提供します。開発者はデュアルモード操作を選択でき、一般的なタスクでは最小限の遅延のために推論を無効にしたり、複雑なポリシーには有効にしたりできます。

トレーニングデータ

Nemotron 3.5 の基盤となるデータセットは、Nemotron 3 で使用された多言語・多様なモダリティを融合させたものの進化版であり、推論能力とカスタムポリシー対応機能を強化するための追加要素が含まれています。当社は以下のデータソースを利用しました。

Nemotron Safety Guard Dataset v3 から取得した多言語テキスト安全性データ：文化背景に配慮したサブセットからサンプリングされ、各安全性カテゴリおよび安全/不安全の分割において比例代表が確保されています。

NVIDIA によって英語で収集・人間アノテーションされた多様性モダリティデータ：これを 12 ヶ国語に翻訳しました。特に重要なのは、トレーニング画像の 99% が合成生成物ではなく実写の写真である点です。これは、既存の VLGuard や MM-SafetyBench などのデータセットが SDXL で生成された画像に過度に依存し、本番環境のコンテンツに見られる文化的な文脈や敵対的な複雑さを欠いているという、多様性モダリティ安全性ベンチマーク分野における既知の弱点に対処するものです。ライセンス制約のためすべての実写画像を公開することはできませんが、Wikimedia からの画像と合成生成物の一部は公開可能です。

Nemotron VLM Dataset v2 から取得した安全な多様性モダリティデータ：スキャンされた文書、チャート、論文、図表および関連するクエリを含み、モデルが benign な専門コンテンツを過剰に検出しないように保証します。

大規模な教師モデル（Qwen 397B）によって生成された思考連鎖（chain-of-thought）出力から導き出され、その後 Qwen 80B を用いて短縮された推論トレース：これらはモデルに推論方法を教えるために使用されます。

CantTalkAboutThis データセットから得たデータに基づくトピック追跡。このデータセットは、ヘルスケア、金融、銀行、教育など多様な企業導入シナリオにわたるポリシー仕様と判定のペアで構成されています。

合成データは全トレーニングボリュームの約 10% を占め、主に jailbreak パターンの多様化、稀なポリシー違反事例の生成、およびマルチモーダル敵対的ケースの作成に使用されます。

ベンチマーク

Nemotron 3.5 Content Safety は、VLGuard、MM-SafetyBench、PolyGuard、RTP-LX、Aya Redteaming、XSafety、MultiJail、Aegis、Dynaguardrail、CoSA を含む、多言語・マルチモーダル・カスタムポリシーの安全性ベンチマーク across 評価されました。これらの評価は、企業における安全性のコアな生産課題を反映しています：すなわち、著しいレイテンシを追加することなく、グローバルな言語、テキストおよび画像入力、そしてドメイン固有のポリシーにわたって一貫したガードレール（安全柵）を適用することです。

Nemotron 3 は、マルチモーダル有害コンテンツテストで平均精度 84% を達成し、LlamaGuard-4-12B の約半分のレイテンシを示す強力なベースラインを設定しました。Nemotron 3.5 は、このコンパクトな 4B エフィシアシーを維持しつつ、カスタムポリシーのサポートと推論トレースを追加しています。

多言語かつマルチモーダルな安全性ベンチマークにおいて、Nemotron 3.5 はコンパクトなフットプリントを維持しながらも、有害コンテンツ分類の精度が高く発揮されます。これは多くの安全性モデルが英語優先、テキストのみ、あるいは生産パイプラインで繰り返し実行するにはコストが高すぎるという現状があるため、非常に重要です。Nemotron 3.5 は、多言語対応、マルチモーダル分類、カスタムポリシーサポート、低遅延デプロイメントを一つのモデルに統合するように設計されています。

*Figure 1. Nemotron 3.5 Content Safety は、多言語およびマルチモーダルの安全性ベンチマーク全体で高い有害コンテンツ分類精度を発揮し、評価されたベンチマークセット全体で平均約 85% を達成します。*

言語レベルの結果は、グローバルエンタープライズ AI において多言語対応の安全性がなぜ重要かを浮き彫りにしています。Multilingual Aegis では、Nemotron 3.5 は 12 か国にわたる有害コンテンツ分類精度で平均 96.5% を記録します。RTP-LX では平均 88.8% で、Aegis と RTP-LX の合計平均は 92.7% となります。この一貫性により、チームは英語のみによるモデレーションや個別の地域別安全性モデルに依存するのではなく、顧客・従業員・パートナー向けワークフロー全体で同じ安全性姿勢を適用できるようになります。

*Figure 2. Nemotron 3.5 Content Safety は、12 か国における Multilingual Aegis Cultural + Adapted（プロンプト分類）（有害 F1 スコア）で平均 97% の有害コンテンツ分類精度を達成します。*

*Figure 3. Nemotron 3.5 Content Safety は、12 か国における RTPLX（プロンプト分類）（有害 F1 スコア）で平均 89% の有害コンテンツ分類精度を達成します。*

精度だけでは、本番環境におけるガードレールとしては不十分です。安全性モデルはまた、コンテンツが処理される前や返却される前、あるいは下流へルーティングされる前に実行できるほど効率的である必要があります。Nemotron 3.5 Content Safety のコンパクトな 4B デザインは、繰り返しの安全チェックにかかるコストとレイテンシを削減するのに役立ち、多言語かつマルチモーダルなガードレールを実際の AI アプリケーションで実用的なものにしています。

レイテンシ

デフォルト（THINK モードなし）モードにおけるレイテンシプロファイルは、Nemotron 3 と変わっていません。THINK モードでは推論時間がトレース長に比例して増加しますが、このオーバーヘッドは予測可能であり、同期モデレーションループとは別に予算化することができます。例えば、デフォルトモードがリアルタイムの判断を処理する一方で、THINk モードでの評価を監査パイプラインの一部として非同期で実行することで対応可能です。

*図 4. Nemotron 3.5 Content Safety は、代替となるマルチモーダル安全性モデルと比較して、マルチモーダルベンチマークにおいてエンドツーエンドのレイテンシが 3 倍低下しています。*

別の推論型安全性モデルと比較すると、推論機能を有効にした場合、当社のモデルは最大で 50% 少ないトークンを生成し、コストとレイテンシの観点から効率的です。

ベンチマークギャップへの対応

マルチモーダル安全性研究における繰り返しのテーマは、既存の評価インフラストラクチャにおけるギャップです。Nemotron 3.5 の開発においても、広範な文献に記述されているのと同じようなギャップに直面しました：

テキストのみのカバレッジ：最も広く引用される安全性ベンチマーク（WildGuard, XSTest, HarmBench）はテキスト専用です。マルチモーダル性能をテキストベンチマークの結果から推測することはできません。

合成画像の品質：現在存在する多モーダルベンチマークの多くは、AI で生成された画像（通常は SDXL）を使用しており、実際の写真よりも生産現場におけるコンテンツの難易度を過小評価しています。

実写画像のライセンス：ストックフォトのライセンスでは AI データセットへの再配布が禁止されており、研究環境と本番環境の間に構造的なギャップが生じています。

NVIDIA の多モーダルトレーニングデータは、実際の画像と文化的にニュアンスのある多言語プロンプトを備えており、モデル学習におけるこれらのギャップの一部を埋めるために設計されています。評価のためのベンチマークに関する課題は、より広範なセーフティ研究コミュニティにおいて未解決の問題として残っています。

Getting Started（はじめに）

Nemotron 3.5 Content Safety は、研究および商用利用のために NVIDIA Open Model License の下で Hugging Face で提供されており、トレーニング用データセットも併せて利用可能です。このモデルは transformers、vLLM、SGLang をサポートしており、事前パッケージ化された GPU 最適化推論マイクロサービスが必要なチーム向けには、build.nvidia.com で本番グレードの NVIDIA NIM として利用できます。

開発者は、Baseten、Eigen AI、DeepInfra、OpenRouter、および Vultr を含む推論プラットフォームを通じて、本モデルにもアクセスできます。

カスタムポリシーワークフローについては、NVIDIA は Claude および Codex と互換性のある、カスタムポリシーを生成するためのスキルを提供しています。また、本モデルの使用方法を示すクックブックも用意されています。カスタムポリシーと推論トレースにより、チームは意思決定の監査可能性を維持しつつ、ドメイン固有のルールに合わせた安全性動作への適応が可能になります。

原文を表示

Back to Articles

What's New in Nemotron 3.5 Content Safety 1. Unified Multimodal Evaluation
2. Global Language Coverage
3. Custom Policy Enforcement
4. Reasoning Traces (THINK Mode)
5. Safety Dataset

Model Architecture
Reasoning
Training Data
Benchmarking
Latency
Addressing the Benchmark Gap
Getting Started

The last two years have seen NVIDIA's content safety stack grow from a focused English text classifier into a family of specialized models—each extending coverage to new modalities, languages, and inference modes. Nemotron 3 Content Safety, released in March 2026, combined multimodal and multilingual capabilities for the first time in a single 4B-parameter model. Today, we are releasing Nemotron 3.5 Content Safety, which completes that arc: a single model that unifies multimodal input, multilingual reach, custom enterprise policy enforcement, and auditable reasoning into one inference call.

This post covers what changes in 3.5, the design decisions behind each new capability, and how to integrate the model into production safety pipelines.

What's New in Nemotron 3.5 Content Safety

1. Unified Multimodal Evaluation

Nemotron 3 introduced image understanding; Nemotron 3.5 deepens the multimodal integration. The model takes a user prompt, an optional image, and an optional assistant response as a single context window and produces a coherent safety verdict over the combined input. Evaluating all three together—rather than scoring each independently—closes a well-known gap in multimodal safety scenarios: policy violations that only emerge from the *interaction* between text and image, or between request and response, are now caught in a single pass.

2. Global Language Coverage

Nemotron 3.5 maintains the 12-language explicit training coverage of its predecessors—English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, and Italian—while also inheriting strong zero-shot generalization across approximately 140 languages from the Gemma 3 base model. This means deployments in markets where training data is sparse (e.g., Southeast Asian languages, Scandinavian languages, less-resourced African languages) benefit from base-model multilingual transfer without requiring separate fine-tuning.

3. Custom Policy Enforcement

This is the most significant architectural addition in 3.5 relative to Nemotron 3. Production deployments rarely operate under a single universal safety taxonomy. A healthcare platform has a different risk profile than a financial services chatbot, a developer tools IDE, or a children's education app. Nemotron 3.5 accepts a custom policy specification alongside the input. The model reasons over that policy when producing its verdict rather than deferring entirely to the built-in taxonomy. This extends the work first introduced in Nemotron Content Safety Reasoning 4B to the full multimodal, multilingual setting.

4. Reasoning Traces (THINK Mode)

Every safety verdict in Nemotron 3.5 can be accompanied by an auditable reasoning trace via an optional think mode. When enabled, the model outputs its step-by-step reasoning before delivering a final safe / unsafe label and, optionally, the violated categories.

code

<think>
The user prompt asks for guidance on acquiring a controlled substance without a prescription.
The assistant response provides specific sourcing steps and references an online marketplace.
This interaction violates the Criminal Planning/Confessions and Controlled Substances categories.
The image (a pharmacy exterior) provides locational context but does not alter the verdict.
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: Criminal Planning/Confessions, Controlled Substances

When latency is the primary constraint, THINK mode can be disabled to return to the same low-latency binary verdict available in Nemotron 3.

5. Safety Dataset

With Nemotron 3.5, we are releasing our safety dataset. This is an important milestone since most OSS safety models don't generally provide the training or evaluation sets. This problem is worse for the multimodal space where artifacts such as images or videos are often derived from resources with restrictive licensing terms. The Nemotron 3.5 Content Safety Dataset is multimodal, multilingual, and includes safety reasoning traces that were used to train the model. These reasoning traces were generated in a 2-step manner to make them concise, similar to the Nemotron Content Safety Reasoning 4B model.

Model Architecture

Nemotron 3.5 Content Safety is built on Google Gemma 3 4B IT (4B parameters), providing a 128K context window, strong vision-language reasoning, and broad multilingual coverage. NVIDIA fine-tunes this base with a LoRA adapter that installs targeted safety classification behavior while keeping the model compact enough for real-time deployment on 8GB+ VRAM GPUs.

The inference interface supports three output modes:

Mode 1 — Low-latency binary verdict:

code

User Safety: safe
Response Safety: unsafe

Mode 2 — Binary verdict with categories:

code

User Safety: safe
Response Safety: unsafe
Safety Categories: Violence, Criminal Planning/Confessions

Mode 3 — THINK mode (reasoning + verdict):

code

<think>
[step-by-step reasoning trace]
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: [categories]

The safety taxonomy follows the Aegis 2.0 framework: 13 core categories aligned with the MLCommons safety taxonomy, plus 10 fine-grained subcategories. This alignment allows direct comparison with other open and closed guard systems benchmarked on Aegis-taxonomy datasets.

Reasoning

Reasoning is a supercharger for content safety classification because it provides the necessary context, customization, and accountability required for production AI systems, especially in enterprise and regulated environments.

Enables Custom and Contextual Policy Enforcement

Reasoning allows a content safety model to dynamically interpret and enforce custom, domain-specific policies defined in natural language at the time of inference. This is necessary because production deployments rarely operate under a single, universal safety taxonomy. A financial services chatbot has a different risk profile than a children's education app which may have a lower tolerance for profanity. This capability supports:

Category Suppression: Disabling irrelevant categories, such as preventing a "violence" category trigger when a DevOps tool handles the phrase "terminate a process".

Custom Category Injection: Defining proprietary risk categories specific to an organization's regulatory or product policies.

Provides Auditable and Documented Justification

The reasoning traces show the model's step-by-step logic before it delivers a final safe or unsafe verdict. This documented justification serves several purposes:

Compliance and Audit Logging: Regulated industries often require documented justifications for content moderation decisions.

Human Review: Reviewers can audit why a verdict was reached to identify systematic model errors.

Policy Iteration: The traces reveal how the model interprets edge cases, allowing teams to iteratively refine and improve custom policy language.

Latency

While reasoning can introduce latency, the Nemotron model addresses this by condensing reasoning chains into concise summaries to limit output tokens and increase efficiency. This is done in a 2-step process similar to what was done in the predecessor model Nemotron-Content-Safety-Reasoning-4B. In the first step, we use larger, more powerful models such as Qwen 397B to generate chain-of-thought reasoning traces based upon provided prompts, images, and responses. We also provided the ground-truth labels of the samples to avoid any misclassification that can find its way into the reasoning traces. In step 2, we make these reasoning traces more concise by using another large model such as Qwen 80B. We specifically instruct this model to rephrase the original traces (from step 1) so that it fits in no more than 3 sentences. Based on our experiments, most reasoning traces generated are under 3 sentences.

The efficient reasoning traces optimization allows for low-latency custom policy enforcement. Furthermore, the reasoning traces provide a valuable training signal that can be used for training specialized moderator models. Developers can choose a dual-mode operation, disabling reasoning for minimal latency in generic tasks or enabling it for complex policies.

Training Data

The dataset driving Nemotron 3.5 is an evolution of the multimodal, multilingual blends used for Nemotron 3, with additions targeting the reasoning and custom-policy capabilities. We have used the following sources of data:

Multilingual text safety data from Nemotron Safety Guard Dataset v3, sampled from culturally nuanced subsets with proportional representation across safety categories and safe/unsafe splits.

Human-annotated multimodal data collected in English by NVIDIA, translated into 12 languages. Critically, 99% of training images are real photographs—not synthetic generations. This directly addresses a known weakness in the multimodal safety benchmark landscape, where existing datasets like VLGuard and MM-SafetyBench rely heavily on SDXL-generated images that lack the cultural texture and adversarial complexity of production content. While not all of these real images could be released due to licensing constraints, we are still able to release a subset of images from Wikimedia and synthetic generation.

Safe multimodal data from Nemotron VLM Dataset v2, covering scanned documents, charts, papers, and diagrams with associated queries—ensuring the model does not over-flag benign professional content.

Reasoning traces derived from chain-of-thought outputs produced by larger teacher models—Qwen 397B and then shortened using Qwen 80B—are used to teach the model how to reason.

Topic following data from the CantTalkAboutThis dataset consisting of policy-specification/verdict pairs across a range of enterprise deployment scenarios (healthcare, finance, banking, education, etc.).

Synthetic data accounting for roughly 10% of total training volume, used primarily to diversify jailbreak patterns, generate rare policy violation examples, and produce multimodal adversarial cases.

Benchmarking

Nemotron 3.5 Content Safety was evaluated across multilingual, multimodal, and custom-policy safety benchmarks, including VLGuard, MM-SafetyBench, PolyGuard, RTP-LX, Aya Redteaming, XSafety, MultiJail, Aegis, Dynaguardrail, and CoSA. These evaluations reflect the core production challenge for enterprise safety: applying consistent guardrails across global languages, text and image inputs, and domain-specific policies without adding significant latency.

Nemotron 3 set a strong baseline with 84% average accuracy on multimodal harmful-content tests and roughly half the latency of LlamaGuard-4-12B. Nemotron 3.5 maintains that compact 4B efficiency while adding custom policy support and reasoning traces.

Across multilingual and multimodal safety benchmarks, Nemotron 3.5 delivers strong harmful-content classification accuracy while maintaining a compact footprint. This matters because many safety models remain English-first, text-only, or too costly to run repeatedly in production pipelines. Nemotron 3.5 is designed to combine multilingual coverage, multimodal classification, custom-policy support, and low-latency deployment in one model.

*Figure 1. Nemotron 3.5 Content Safety delivers strong harmful-content classification accuracy across multilingual and multimodal safety benchmarks, averaging about 85% across the evaluated benchmark set.*

The language-level results highlight why multilingual safety matters for global enterprise AI. On Multilingual Aegis, Nemotron 3.5 averages 96.5% harmful-content classification accuracy across 12 languages. On RTP-LX, it averages 88.8%, for a combined Aegis and RTP-LX average of 92.7%. This consistency helps teams apply the same safety posture across customer, employee, and partner-facing workflows instead of relying on English-only moderation or separate regional safety models.

*Figure 2. Nemotron 3.5 Content Safety averages 97% harmful-content classification accuracy on Multilingual Aegis Cultural + Adapted (prompt classification) (harmful-f1) across 12 languages.*

*Figure 3. Nemotron 3.5 Content Safety averages 89% harmful-content classification accuracy on RTPLX (prompt classification) (harmful-f1) across 12 languages.*

Accuracy alone is not enough for production guardrails. Safety models must also be efficient enough to run before content is processed, returned, or routed downstream. Nemotron 3.5 Content Safety's compact 4B design helps reduce the cost and latency of repeated safety checks, making multilingual and multimodal guardrails practical for real-world AI applications.

Latency

The latency profile is unchanged from Nemotron 3 in the default (no THINK) mode. THINK mode adds inference time proportional to trace length, but this overhead is predictable and can be budgeted separately from the synchronous moderation loop—for instance, by running THINK-mode evaluation asynchronously as part of an audit pipeline while the default mode handles real-time decisions.

*Figure 4. Nemotron 3.5 Content Safety achieves 3x lower end-to-end latency on a multimodal benchmark compared to an alternative multimodal safety model.*

Compared to another reasoning safety model, our model generated up to 50% fewer tokens when reasoning is enabled, making it efficient in terms of cost and latency.

Addressing the Benchmark Gap

A recurring theme in multimodal safety research is the gaps in existing evaluation infrastructure. Nemotron 3.5's development encountered the same gaps documented in the broader literature:

Text-only coverage: The most widely cited safety benchmarks (WildGuard, XSTest, HarmBench) are text-only. Multimodal performance cannot be inferred from text-benchmark results.

Synthetic image quality: Most multimodal benchmarks that exist use AI-generated images (typically SDXL) rather than real photographs, understating the difficulty of real production content.

Real-image licensing: Stock photo licenses prohibit redistribution in AI datasets, creating a structural gap between research and production conditions.

NVIDIA's multimodal training data—with real images and culturally nuanced multilingual prompts—is designed to fill some of these gaps for model training. The benchmark gap for evaluation remains an open problem for the broader safety research community.

Getting Started

Nemotron 3.5 Content Safety is available on Hugging Face under the NVIDIA Open Model License for research and commercial use, along with the training dataset. It supports transformers, vLLM, and SGLang, and is available as a production-grade NVIDIA NIM on build.nvidia.com for teams that need a pre-packaged, GPU-optimized inference microservice.

Developers can also access the model through inference platforms including Baseten, Eigen AI, DeepInfra, OpenRouter, and Vultr.

For custom policy workflows, NVIDIA provides a Claude- and Codex-compatible skill for generating custom policies, along with cookbooks showing how to use the model. Custom policies and reasoning traces help teams adapt safety behavior to domain-specific rules while keeping decisions auditable.

この記事をシェア

MarkTechPost重要度42026年7月21日 06:14

アリババTongyi Labが多言語TTSモデル「Qwen-Audio-3.0-TTS」発表

MarkTechPost重要度42026年7月20日 06:42

アリババ、2.4兆パラメータ多モーダルモデル「Qwen3.8-Max」を公開

KDnuggets重要度42026年7月19日 18:50

EU AI 法、AI システムは高リスクか？

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年6月5日 03:57·約18分

Nemotron 3.5 コンテンツセーフティ：グローバル企業向けカスタマイズ可能なマルチモーダル安全性

#Nemotron #コンテンツセーフティ #マルチモーダル #エンタープライズ AI #コンプライアンス

TL;DR

AI深層分析2026年6月11日 00:18

重要/ 5段階

深度40%

キーポイント

カスタマイズ可能な多モーダル安全性の実現

グローバルエンタープライズ向けのコンプライアンス対応

地域ごとの規制要件や業界特有のリスク基準に合わせてモデルを微調整できるため、大規模組織における安全な AI 導入を支援する。

Hugging Face エコシステムでの公開と統合

NVIDIA の最新技術を Hugging Face で直接利用可能にし、開発者が容易にモデルを取得・デプロイできる環境を整備した。

安全データセットの公開

Gemma 3 ベースのアーキテクチャ

柔軟な推論モード

カスタマイズ可能かつ監査可能な推論機能

重要な引用

Customizable Multimodal Safety for Global Enterprise AI

Tailor safety filters to your specific organizational policies and regional compliance requirements.

With Nemotron 3.5, we are releasing our safety dataset. This is an important milestone since most OSS safety models don't generally provide the training or evaluation sets.

NVIDIA fine-tunes this base with a LoRA adapter that installs targeted safety classification behavior while keeping the model compact enough for real-time deployment on 8GB+ VRAM GPUs.

「99% of training images are real photographs—not synthetic generations. This directly addresses a known weakness in the multimodal safety benchmark landscape...」

「Reasoning allows a content safety model to dynamically interpret and enforce custom, domain-specific policies defined in natural language at the time of inference.」

影響分析・編集コメントを表示

影響分析

編集コメント

記事一覧に戻る

Nemotron 3.5 コンテンツセーフティの新機能

統合型マルチモーダル評価
グローバル言語対応
カスタムポリシーの強制適用
推論トレース（THINK モード）
セーフティデータセット

モデルアーキテクチャ

推論

トレーニングデータ

ベンチマーク

レイテンシ

ベンチマークギャップへの対応

はじめに

Nemotron 3.5 コンテンツセーフティの新機能

1. 統合型マルチモーダル評価

2. グローバル言語対応

3. カスタムポリシーの実行

{"translation": "翻訳全文"}

4. 推論トレース (THINK モード)

ユーザー安全性：不安全

レスポンス安全性：不安全

安全カテゴリ：犯罪計画/自白、規制薬物

5. セーフティデータセット

モデルアーキテクチャ

推論インターフェースは 3 つの出力モードをサポートします：

モード 1 — 低遅延二値判定：

ユーザー安全性：安全

レスポンス安全性：不安全

モード2 — カテゴリ付きの二値判定:

ユーザー安全性：安全

レスポンス安全性：不安全

安全性カテゴリ：暴力、犯罪計画/自白

モード3 — THINK モード（推論＋判定）:

ユーザー安全性：不安全

レスポンス安全性：不安全

安全性カテゴリ：[カテゴリ]

推論

カスタムおよび文脈に応じたポリシー適用を可能にする

カテゴリの抑制：「プロセスを停止する」というフレーズをDevOpsツールが扱う際に、「暴力」カテゴリのトリガーを防ぐなど、無関係なカテゴリを無効化すること。

カスタムカテゴリの注入：組織の規制や製品ポリシーに固有のリスクカテゴリを定義する機能。

監査可能かつ文書化された根拠の提供

コンプライアンスと監査ログ：規制業界では、コンテンツモデレーションの決定に対する文書化された根拠を要求されることが多くあります。
人的レビュー：レビュアーは判断に至った理由を検証し、モデルに系統的なエラーがないかを確認できます。
ポリシーの反復改善：トレースからは、モデルが境界ケースをどのように解釈しているかが明らかになり、チームはカスタムポリシーの文言を反復的に精緻化・改善することが可能になります。

レイテンシ

トレーニングデータ

Nemotron Safety Guard Dataset v3 から取得した多言語テキスト安全性データ：文化背景に配慮したサブセットからサンプリングされ、各安全性カテゴリおよび安全/不安全の分割において比例代表が確保されています。

NVIDIA によって英語で収集・人間アノテーションされた多様性モダリティデータ：これを 12 ヶ国語に翻訳しました。特に重要なのは、トレーニング画像の 99% が合成生成物ではなく実写の写真である点です。これは、既存の VLGuard や MM-SafetyBench などのデータセットが SDXL で生成された画像に過度に依存し、本番環境のコンテンツに見られる文化的な文脈や敵対的な複雑さを欠いているという、多様性モダリティ安全性ベンチマーク分野における既知の弱点に対処するものです。ライセンス制約のためすべての実写画像を公開することはできませんが、Wikimedia からの画像と合成生成物の一部は公開可能です。

Nemotron VLM Dataset v2 から取得した安全な多様性モダリティデータ：スキャンされた文書、チャート、論文、図表および関連するクエリを含み、モデルが benign な専門コンテンツを過剰に検出しないように保証します。

大規模な教師モデル（Qwen 397B）によって生成された思考連鎖（chain-of-thought）出力から導き出され、その後 Qwen 80B を用いて短縮された推論トレース：これらはモデルに推論方法を教えるために使用されます。

CantTalkAboutThis データセットから得たデータに基づくトピック追跡。このデータセットは、ヘルスケア、金融、銀行、教育など多様な企業導入シナリオにわたるポリシー仕様と判定のペアで構成されています。

合成データは全トレーニングボリュームの約 10% を占め、主に jailbreak パターンの多様化、稀なポリシー違反事例の生成、およびマルチモーダル敵対的ケースの作成に使用されます。

ベンチマーク

*Figure 3. Nemotron 3.5 Content Safety は、12 か国における RTPLX（プロンプト分類）（有害 F1 スコア）で平均 89% の有害コンテンツ分類精度を達成します。*

レイテンシ

ベンチマークギャップへの対応

テキストのみのカバレッジ：最も広く引用される安全性ベンチマーク（WildGuard, XSTest, HarmBench）はテキスト専用です。マルチモーダル性能をテキストベンチマークの結果から推測することはできません。

合成画像の品質：現在存在する多モーダルベンチマークの多くは、AI で生成された画像（通常は SDXL）を使用しており、実際の写真よりも生産現場におけるコンテンツの難易度を過小評価しています。

実写画像のライセンス：ストックフォトのライセンスでは AI データセットへの再配布が禁止されており、研究環境と本番環境の間に構造的なギャップが生じています。

Getting Started（はじめに）

開発者は、Baseten、Eigen AI、DeepInfra、OpenRouter、および Vultr を含む推論プラットフォームを通じて、本モデルにもアクセスできます。

原文を表示

Back to Articles

What's New in Nemotron 3.5 Content Safety 1. Unified Multimodal Evaluation
2. Global Language Coverage
3. Custom Policy Enforcement
4. Reasoning Traces (THINK Mode)
5. Safety Dataset

Model Architecture
Reasoning
Training Data
Benchmarking
Latency
Addressing the Benchmark Gap
Getting Started

This post covers what changes in 3.5, the design decisions behind each new capability, and how to integrate the model into production safety pipelines.

What's New in Nemotron 3.5 Content Safety

1. Unified Multimodal Evaluation

2. Global Language Coverage

3. Custom Policy Enforcement

4. Reasoning Traces (THINK Mode)

code

<think>
The user prompt asks for guidance on acquiring a controlled substance without a prescription.
The assistant response provides specific sourcing steps and references an online marketplace.
This interaction violates the Criminal Planning/Confessions and Controlled Substances categories.
The image (a pharmacy exterior) provides locational context but does not alter the verdict.
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: Criminal Planning/Confessions, Controlled Substances

When latency is the primary constraint, THINK mode can be disabled to return to the same low-latency binary verdict available in Nemotron 3.

5. Safety Dataset

Model Architecture

The inference interface supports three output modes:

Mode 1 — Low-latency binary verdict:

code

User Safety: safe
Response Safety: unsafe

Mode 2 — Binary verdict with categories:

code

User Safety: safe
Response Safety: unsafe
Safety Categories: Violence, Criminal Planning/Confessions

Mode 3 — THINK mode (reasoning + verdict):

code

<think>
[step-by-step reasoning trace]
</think>

User Safety: unsafe
Response Safety: unsafe
Safety Categories: [categories]

Reasoning

Enables Custom and Contextual Policy Enforcement

Category Suppression: Disabling irrelevant categories, such as preventing a "violence" category trigger when a DevOps tool handles the phrase "terminate a process".

Custom Category Injection: Defining proprietary risk categories specific to an organization's regulatory or product policies.

Provides Auditable and Documented Justification

The reasoning traces show the model's step-by-step logic before it delivers a final safe or unsafe verdict. This documented justification serves several purposes:

Compliance and Audit Logging: Regulated industries often require documented justifications for content moderation decisions.

Human Review: Reviewers can audit why a verdict was reached to identify systematic model errors.

Policy Iteration: The traces reveal how the model interprets edge cases, allowing teams to iteratively refine and improve custom policy language.

Latency

Training Data

Multilingual text safety data from Nemotron Safety Guard Dataset v3, sampled from culturally nuanced subsets with proportional representation across safety categories and safe/unsafe splits.

Human-annotated multimodal data collected in English by NVIDIA, translated into 12 languages. Critically, 99% of training images are real photographs—not synthetic generations. This directly addresses a known weakness in the multimodal safety benchmark landscape, where existing datasets like VLGuard and MM-SafetyBench rely heavily on SDXL-generated images that lack the cultural texture and adversarial complexity of production content. While not all of these real images could be released due to licensing constraints, we are still able to release a subset of images from Wikimedia and synthetic generation.

Safe multimodal data from Nemotron VLM Dataset v2, covering scanned documents, charts, papers, and diagrams with associated queries—ensuring the model does not over-flag benign professional content.

Reasoning traces derived from chain-of-thought outputs produced by larger teacher models—Qwen 397B and then shortened using Qwen 80B—are used to teach the model how to reason.

Topic following data from the CantTalkAboutThis dataset consisting of policy-specification/verdict pairs across a range of enterprise deployment scenarios (healthcare, finance, banking, education, etc.).

Synthetic data accounting for roughly 10% of total training volume, used primarily to diversify jailbreak patterns, generate rare policy violation examples, and produce multimodal adversarial cases.

Benchmarking

*Figure 2. Nemotron 3.5 Content Safety averages 97% harmful-content classification accuracy on Multilingual Aegis Cultural + Adapted (prompt classification) (harmful-f1) across 12 languages.*

*Figure 3. Nemotron 3.5 Content Safety averages 89% harmful-content classification accuracy on RTPLX (prompt classification) (harmful-f1) across 12 languages.*

Latency

*Figure 4. Nemotron 3.5 Content Safety achieves 3x lower end-to-end latency on a multimodal benchmark compared to an alternative multimodal safety model.*

Compared to another reasoning safety model, our model generated up to 50% fewer tokens when reasoning is enabled, making it efficient in terms of cost and latency.

Addressing the Benchmark Gap

A recurring theme in multimodal safety research is the gaps in existing evaluation infrastructure. Nemotron 3.5's development encountered the same gaps documented in the broader literature:

Text-only coverage: The most widely cited safety benchmarks (WildGuard, XSTest, HarmBench) are text-only. Multimodal performance cannot be inferred from text-benchmark results.

Synthetic image quality: Most multimodal benchmarks that exist use AI-generated images (typically SDXL) rather than real photographs, understating the difficulty of real production content.

Real-image licensing: Stock photo licenses prohibit redistribution in AI datasets, creating a structural gap between research and production conditions.

Getting Started

Developers can also access the model through inference platforms including Baseten, Eigen AI, DeepInfra, OpenRouter, and Vultr.

この記事をシェア

MarkTechPost重要度42026年7月21日 06:14

アリババTongyi Labが多言語TTSモデル「Qwen-Audio-3.0-TTS」発表

MarkTechPost重要度42026年7月20日 06:42

アリババ、2.4兆パラメータ多モーダルモデル「Qwen3.8-Max」を公開

KDnuggets重要度42026年7月19日 18:50

EU AI 法、AI システムは高リスクか？

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

重要な引用

影響分析

編集コメント

Nemotron 3.5 コンテンツセーフティの新機能

1. 統合型マルチモーダル評価

2. グローバル言語対応

3. カスタムポリシーの実行

4. 推論トレース (THINK モード)

5. セーフティデータセット

モデルアーキテクチャ

推論

トレーニングデータ

ベンチマーク

レイテンシ

ベンチマークギャップへの対応

Getting Started（はじめに）

What's New in Nemotron 3.5 Content Safety

1. Unified Multimodal Evaluation

2. Global Language Coverage

3. Custom Policy Enforcement

4. Reasoning Traces (THINK Mode)

5. Safety Dataset

Model Architecture

Reasoning

Training Data

Benchmarking

Latency

Addressing the Benchmark Gap

Getting Started

関連記事

キーポイント

重要な引用

影響分析

編集コメント

Nemotron 3.5 コンテンツセーフティの新機能

1. 統合型マルチモーダル評価

2. グローバル言語対応

3. カスタムポリシーの実行

4. 推論トレース (THINK モード)

5. セーフティデータセット

モデルアーキテクチャ

推論

トレーニングデータ

ベンチマーク

レイテンシ

ベンチマークギャップへの対応

Getting Started（はじめに）

What's New in Nemotron 3.5 Content Safety

1. Unified Multimodal Evaluation

2. Global Language Coverage

3. Custom Policy Enforcement

4. Reasoning Traces (THINK Mode)

5. Safety Dataset

Model Architecture

Reasoning

Training Data

Benchmarking

Latency

Addressing the Benchmark Gap

Getting Started

関連記事