Amazon Science·2026年5月5日 00:07·約18分

AI に信頼性を組み込む

#Responsible AI #LLM #Pretraining #Amazon AGI #Model Safety

TL;DR

Amazon は、生成 AI ブーム以前から蓄積した責任ある AI（RAI）の知見を基盤に、モデル開発の全段階にリスク管理を組み込む体系的なアプローチと、70 以上の内部ツールを公開している。

AI深層分析2026年5月5日 01:03

重要/ 5段階

深度40%

キーポイント

設計初期からの責任ある AI（RAI）の実装

Amazon は RAI を後付けの機能ではなく、製品設計の初日から組み込むべきものとして位置づけ、Alexa の経験から得た知見を AGI 組織に継承している。

モデル開発ライフサイクル全体でのリスク管理

事前学習（Pretraining）、学習後（Post-training）、評価、第三者監視の 4 つのフェーズにおいて、それぞれ固有の課題に対処する RAI パイプラインを構築している。

事前学習段階での安全性データの強化

公開データに加え、安全・セキュリティ・公平性の原則を教えるための独自に設計された多様なデータセット（「豊かな食事」）を用いて、モデルの基礎的な概念を教育している。

3 つのアプローチによる適応力の構築

リスクの事前予測、曖昧さへの対応学習、政府移行や規制変更など社会変化へのシステム適応という 3 つの柱で信頼性の高い AI を実現している。

影響分析・編集コメントを表示

影響分析

この記事は、大規模 AI モデルの開発において、単なるコンプライアンス対応を超え、技術的プロセスの根幹に信頼性を組み込む必要性を明確に示しています。特に、事前学習段階でのデータ設計や、社会変化への適応力という視点は、業界全体が直面する「ブラックボックス化」や「規制遅れ」という課題に対する実践的な解決策として大きな影響を与えるでしょう。

編集コメント

生成 AI の安全性が議論される中で、Amazon が「後付け」ではなく「設計初期からの組み込み」という根本的なアプローチを採っている点は非常に示唆に富んでいます。特に事前学習段階でのデータ教育の重要性を強調している点は、開発者にとって重要な指針となるでしょう。

Amazon では、AI は倉庫の物流からカスタマーサービスのチャットボット、さらに数千の企業が利用する AWS クラウドサービスに至るまで、あらゆる分野に浸透しており、ビジネス上不可欠な技術となっています。したがって、Amazon が開発・展開するモデルは可能な限り安全で公平かつ堅牢であることが極めて重要であり、責任ある AI（RAI）はオプションの付加機能ではなく、必須の要素です。Amazon の人工一般知能（AGI）組織におけるシニアサイエンスマネージャー兼 RAI リードを務める Rahul Gupta は、「責任は製品設計の初日から組み込まれている」と述べています。Amazon の安全性と責任へのコミットメントは、生成 AI ブームが起きるずっと以前から存在しています。Gupta と彼のチームの研究者たちは Alexa AI 組織で活動しており、同社では「RAI をどのように実施すべきかを定義する能力を培ってきました」。彼が振り返るところ、焦点は RAI の方針と実装、およびその有効性を評価する方法の開発にありました。Amazon が独自の大型モデルの構築を開始すると、Alexa から得られた RAI の専門知識が貴重なリソースとなりました。Amazon のポリシーチームと連携し、AGI の科学者たちは、モデル開発の 4 つの段階——事前学習（pretraining）、事後学習（post-training）、評価、第三者による監視——に対応する RAI パイプラインを構築しました。各段階において、研究者たちは信頼できるシステムが状況やアプリケーション、地理を超えてスケールして適応できるようにするために、それぞれ固有の課題に取り組んでいます。この枠組みに基づき、Amazon は 70 以上の社内および外部向け RAI ツールを開発し、500 本以上の研究論文を資金提供または発表し、従業員に対して数万時間にわたる RAI に特化したトレーニングを提供してきました。Amazon の RAI に対するアプローチは三本柱から成り立っています：リスクが顕在化する前に予測する、モデルに不確実性をナビゲートさせる方法を教える、そして政府の移行や著名なインシデント、新たな規制、その他の社会変化に適応できるシステムを構築する。以下は、このアプローチを実践している Amazon の責任ある AI チームおよびポリシーチームに所属する科学者たちです。それぞれが AI ライフサイクルの異なる段階に取り組んでいます。

基礎の教育：事前学習

Chentao Ye は AGI RAI チームのシニア応用科学者で、LLM（大規模言語モデル）トレーニングにおける最も初期の段階である事前学習を担当しています。この段階では、モデルが一般的な言語能力を習得します。Ye によれば、Amazon のポリシーチームが設定した方針に適応するために必要な情報をモデルに持たせるためにも、この段階で RAI に取り組むことがますます重要になっています。「事前学習は、私たちが RAI の最も基本的な概念を教える段階です」と Ye は述べています。「子供に世界について教えるようなものであり、彼らがある程度の判断を下すことを期待する前の段階です。」通常、事前学習には大量の公開データが使用されますが、RAI チームは安全性、セキュリティ、公平性の原則を植え付けるために特別に設計されたデータセットでそのデータを補完しています。これらのデータセットは膨大かつ多様であり、社内および公開の RAI ガイドライン、ベストプラクティス、RAI 関連のニュースやインシデント、化学・原子力工学やコーディングセキュリティなどのドメインに関する情報、テキスト、音声、画像などを含む「豊かな食事」です。また、モデルがグローバルで多言語対応となるように、異なる言語や文化からの情報もコーパスに含まれています。

この一連の情報をモデルによりよく取り込ませるために、研究者たちは学習課題（トレーニングタスク）を作成します。「データがあるだけでは不十分です。モデルがそれを効果的に処理し理解できるように支援する必要があります」と Ye は説明します。例えば、Ye と彼の同僚たちはプライバシーに関するポリシー文書を取り上げ、プライバシー概念の説明、コンプライアンスに関する質問への回答、特定の行動がプライバシーガイドラインに違反するかどうかの判断など、複数の学習課題に変換します。これらの多様なタスクは、モデルが RAI の原則についてより深く、より微妙な理解を発達させるのに役立ちます。

もう一つの活発な研究領域は、トレーニングコーパス内の潜在的に有害なコンテンツをどう扱うかです。「単にすべてをフィルタリングすればよいわけではありません」と Ye は説明します。「事前学習中に特定の有害な概念に一度も遭遇していないモデルは、それらを敏感なものと認識できず、事後学習におけるガードレールが効果的ではなくなります。」チームは、フィルタリングされたコンテンツに教育的文脈を追加して再導入するアプローチを探っています。つまり、有害性がどのようなものか、なぜ避けるべきかを教えることで、モデルを完全に無知のままにしないようにしています。

RAI の習得に加えて、もう一つの焦点領域は「RAI モダリティアライメント」と呼ばれるものです。LLM は、遭遇するすべてのモダリティ（テキスト、画像、音声など）に対して RAI の原則をどのように適用するかを理解する必要があります。Ye によれば、モダリティアライメントは、他のモダリティをテキストと共有する意味空間にマッピングするものであり、テキストは通常より入手しやすいものです。例えば、大学教科書には高リスクの化学・生物・放射線・核物質（CBRN）の図と、同じ概念に関するテキスト説明が含まれている場合があります。チームは、データを同じ空間に効果的にエンコードするためのさまざまな LLM タスクを設計しています。

Ye によれば、もう一つの活発な研究領域は、事前学習の品質を検証する多様な手法を開発することです。チームは二つの補完的なアプローチを採用しています。一つ目は、モデルが実際に事前学習中に RAI の知識を獲得したかどうかをテストするものです。「私たちはパープレキシティ（perplexity）のような指標を使用します」と Ye は説明します。これは確率分布がいかに良く特定のサンプルを予測するかを定量化するもので、「モデルが特定の RAI ドメインでコンテンツを生成できる能力の良さを測定するために用います。」二つ目のアプローチは、後続の評価課題で現れる可能性のあるスパース（希少）な質問に対するモデルの応答方法をテストするものです。ここで期待される応答——拒絶や回避など——は事前学習中に明示的に教えられていません。「これにより、事前学習で得た RAI の知識が、限られた例や指示だけで現実世界のシナリオに一般化できるかどうかをテストできます」と Ye は述べています。

事後学習：人間フィードバックからの強化学習（RLHF）

モデルが指示に従い、有益かつ無害な応答を生み出す方法を学んだ後、人間フィードバックからの強化学習（Reinforcement Learning from Human Feedback: RLHF）へと進みます。このモデル開発の段階を率いるシニア応用科学者の Charith Peris と、応用科学者の Yao Ma は、RLHF が人間のフィードバックや好みの比較を用いてモデルに判断力を身につけさせることに焦点を当てていると説明します。「RLHF は、ファウンデーションモデルが人間が期待する行動と整合するようにするために実施されます」と Peris は述べています。このトレーニング段階では、クエリに対する応答が事前に設定された基準にどれだけ合致するかに基づいてモデルに報酬が与えられます。これらの報酬は、さまざまな応答検証システムによって提供されます。一つのアプローチは、人間がランク付けした出力を基に訓練されるいわゆる補助的報酬モデル（auxiliary-reward models）を使用するものです。責任ある AI の観点では、この段階でモデルを最適化し、「ポリシー準拠」の応答——Amazon のポリシーチームが策定したルールやガイドラインに沿ったもの——を生成できるようにします。「適切な報酬を提供することは RLHF において重要な部分です」と Ma は述べています。

一つのケースでは、コアモデル自体を使用して、さまざまな安全でないおよび境界的なクエリに対する複数の応答を生成します。これらの応答は、その有用性とポリシー準拠性に基づいて人間によってランク付け・評価され、その後補助的報酬モデルの訓練に使用されます。もう一つの応答検証アプローチでは、独立した LLM をジャッジとして使用します。モデルはトレーニングセット内の各プロンプトに対して応答を生成し、この応答と、応答がポリシー準拠となるための基準（ルブリック）のセットがジャッジに渡されます。その後、ジャッジに応答がルブリックにどれだけ合致しているかに基づいてスコアを提供するよう指示します。補助的報酬モデルとジャッジベースのシステムの両方を個別に、または組み合わせて使用して RLHF 報酬を提供できます。

モデルの評価は二つのフェーズで行われます：トレーニング中とトレーニング後です。最初のフェーズでは、重要な機能全体のパフォーマンスに関する方向性を示す軽量ベンチマークを使用して、頻繁かつ短時間ごとにモデルがテストされます。二つ目のフェーズでは、保存されたチェックポイント（トレーニングの特定の時点におけるモデルの状態とパラメータの完全なスナップショット）が、より広範なテストデータセットに対して体系的に評価され、どのチェックポイントが最も優れた全体的なパフォーマンスを達成したかを特定します。

行動のチェック：評価

評価チームの主要な焦点の一つは、不適切で安全でない、またはポリシー違反の応答を引き起こすような頑健なプロンプトコレクションである「モデル破壊用データセット」を構築することです。「モデルは月を追うごとに改善されています」と Amazon AGI のシニアサイエンティスト Jwala Dhamala は述べています。より大きく、より良い責任ある AI データセットがその大きな役割を果たしていると言います。また、複数のモダリティや地域にわたる責任ある AI 原則のモデルへの取り込み度を捉えるための改善されたメカニズムも重要です。Dhamala によれば、Amazon のポリシーチームと緊密に協力することが RAI 評価を開発する上で鍵となります。

Amazon の RAI 活動には八つの柱があります：プライバシーとセキュリティ、安全性、公平性、真実性と堅牢性、説明可能性、制御可能性、ガバナンス、透明性です。「各柱について、モデルが責任ある AI ポリシーに違反する何かを出力してしまうようなテストに焦点を当てています。同時に、モデルが過度に拒絶しているか、あるいは良性の要求に対して応答しないかどうかをテストすることにも注力しています」と Dhamala は説明します。

データはあらゆる場所から集まります：モデルを破壊しようとする「レッドチームメンバー（red teamers）」と呼ばれる人間の専門家、外部セキュリティパートナー、大学の公開ベンチマーク、そして現実世界の課題が有機的に表面化するソーシャルメディアなどです。Dhamala によれば、RAI チームは事前学習から事後学習、すべてのサポート構造が取り付けられたデプロイ前まで、モデルトレーニングおよびデプロイサイクル全体を通じてモデルを評価します。各段階にはそれぞれ特別に設計された評価プロセスがあり、モデルがエンドユーザーに近い後期の段階ではより多くのテストが行われます。「データセットを集め、評価し、新しいデータセットを集め、再度評価します」と Dhamala は述べています。彼女はまた、チームは現在、評価プロセスの自動化をさらに進めていると付け加えました。また、研究の新たな領域にも取り組んでいます。

数週間や数ヶ月にわたる多くの行き来を必要とする会話における欺瞞（ロング・ホライズン・インタラクションとも呼ばれる）が懸念事項として浮上していますが、それを検出するための確立されたベンチマークは多くありません。それらを作成するには、異なるロング・ホライズンの文脈全体で欺瞞が何を意味するかを理解する必要があり、その理解は社会科学研究に基づいています。

もう一つの未開拓の研究領域は、新興する責任ある AI リスクを評価するための自動レッドチームングフレームワークです。このアイデアは、自律的なエージェントまたはエージェントのシステムが、望ましくない行動を引き起こそうとして競合または協力するというものです。

第三者との協力：フロンティアリスク

RAI 活動の大部分が一般的な誤用のパターンに対処する一方で、AGI のシニア応用科学者 Tong Wang は、異なるカテゴリのリスク——「システム全体を崩壊させる可能性のあるシステム的リスク」、すなわちフロンティアリスク——に焦点を当てています。これには、AI モデルを使用して化学・生物・放射線・核（CBRN）攻撃の研究やサイバー攻撃の研究・実行が行われるケースが含まれます。これは、AI の能力が非専門家によって壊滅的な被害を引き起こすことを可能にするシナリオです。

フロンティアリスクに対する評価プロセスは非常に厳格です。まず、自動化されたベンチマークでモデルが危険な知識を獲得したかどうかをテストします。特定の閾値——大量破壊兵器に関する質問に懸念される精度で回答するなど——を超えると、人間のレビューがトリガーされます。関連分野の第三者専門家が、モデルが安全境界線を越えたかどうかを評価します。そしてこのプロセスは継続的です：各モデル更新ごとに、チームは新モデルの能力を以前のモデルと比較します。「非常に注意深くある必要があります」と Wang は述べています。「偽陽性と偽陰性の両方にコストがかかります。」

公開モデルの場合、特定されたリスクはガードレールによって緩和されます。特定のトピックについて特定のレベルの詳細さで質問した場合、モデルは単に応答しません。しかし、正当な研究者——関連する専門知識と適切な監督を持つ大学や研究所の科学者たち——は、研究のために制限された情報へのアクセスを必要とする場合があります。Wang のチームは、これらの信頼できるユーザーに対して「厳重な監視付きの特別アクセス」を提供するメカニズムを探っています。

そのメカニズムには、Wang が「設定可能性（configurability）」と呼ぶものが含まれます。具体的には、低ランク適応子（LoRA）などの技術を使用して、モデル全体を再学習することなく、特定のユースケースに対してモデルの行動に外科的な変更を加えるものです。「ベースモデル自体に触れないように設定を追加します」と彼は言います。「10 億のパラメータを再学習するのではなく、数個だけです。」

今日、このアプローチはすでに特定のコンテンツポリシーで利用されています。しかし、CBRN などのフロンティアリスクに拡張するのはより困難な問題であり、データ収集と計算コストの両方が大幅に高くなります。「どのアプローチが最も効果的かを研究している未開拓の研究領域です」と Wang は指摘しています。

合意された価値観：ポリシーの策定

「私たちはモデル開発ライフサイクル全体を通じて Amazon のサイエンスチームとパートナーシップを築いています」と、責任ある AI ポリシーおよび製品チームのリーダーである Claire O'Brien Rajkumar は説明します。プロセスは、画像生成モデルか大規模言語モデルかなど、製品チームが立ち上げようとしているものを理解することから始まり、Amazon の責任ある AI に関する八つの主要次元に対して潜在的な危害をマッピングすることから始まります。例えば、画像ジェネレーターを構築する前に、チームはディープフェイク、バイアスの増幅（例えば、医師を白人男性のみとして描く画像）、または不快なコンテンツの生成を試みる試みなどのリスクを予測します。

特定されたリスクは、開発中のモデルの行動境界を定義する具体的なポリシーに変換されます。これらのポリシーは、「後方から働くガイドライン」となり、O'Brien Rajkumar によれば、モデル構築中のすべての後の決定に情報を提供します。例えば、白人男性の医師のみを示す可能性のある単一のベンダーから画像を取得するのではなく、チームは現実世界の複雑さを反映した多様なデータ収集を確保します。

Amazon のポリシーは、業界動向、顧客からの要望、規制、法的要件（特に著作権とコンテンツライセンスに関するもの）などの要因によって形成されています。チームは、規制が不十分な分野でベストプラクティスを確立するために競合他社と協力する Frontier Model Forum や Partnership on AI などの業界団体にも積極的に参加しています。

学術パートナーシップは、ベンチマークの開発や Amazon Nova AI Challenge の「Trusted AI」トラックへの参加などを通じて、新興リスクの特定に役立ちます。ここでは大学生が Nova モデルのセキュリティ脆弱性とそれに対する修正策を特定するために競い合います。

顧客フィードバックは、LLM を使用したセキュリティテストなどの正当なユースケースに対する例外を設けるなど、実用的なポリシー決定に影響を与えます。これは一般的なポリシーでマルウェア生成を禁止している場合でも同様です。

ポリシーチームは、法務、公共政策、製品、セキュリティ、RAI の専門家を含む横断的なワーキンググループを通じて運営されています。EU AI 法やカリフォルニア州 AI 透明性法などの規制の進展が直接ポリシーの進化に影響を与えます。「これらは生きている、呼吸するものです」と O'Brien Rajkumar は指摘し、社会が特定の AI リスクに対してより快適になるか、あるいは不快感を抱くようになるにつれてポリシーも適応しなければならないことを認めています。

ポリシー開発や具体的な責任ある製品ガイドラインを超えて、チームは AI セーフガードの実装を管理し、社内専門家と第三者ベンダーの両方を使用してレッドチームング運用を監督しています。また、モデル出力の手動レビューを実施して現実世界のリスクを評価します。「これらは高度な判断を要する決定であり、ポリシー違反となるかどうかの境界線上で作業しています」と O'Brien Rajkumar は述べています。「各ポリシーが実際に何を意味するのかを本当に理解する必要があります。」

原文を表示

At Amazon, AI now touches everything from warehouse logistics to customer service chatbots to AWS cloud services used by thousands of enterprises, making it a business-critical technology. It’s therefore imperative that the models Amazon develops and deploys are as safe, fair, and robust as possible: responsible AI (RAI) is not an optional add-on. As Rahul Gupta, senior science manager and RAI lead for Amazon’s Artificial General Intelligence (AGI) organization, puts it, “Responsibility is baked into the product design from day one.” Amazon’s commitment to safety and responsibility goes back long before the generative-AI boom. Gupta and researchers on his team worked in the Alexa AI organization, where the company “developed some muscle on defining how RAI should be done.” The focus, he recalls, was on developing policies and implementations as well as methods to evaluate their effectiveness. As Amazon began building its own large models, the RAI expertise from Alexa proved a valuable resource. In concert with Amazon’s policy team, AGI scientists have built an RAI pipeline that addresses four phases of model development: pretraining, post-training, evaluation, and third-party monitoring. At each stage, researchers grapple with distinct challenges to ensure that trustworthy systems can adapt, at scale, across situations, applications, and geographies. From this framework, Amazon has built over 70 internal and external RAI tools, funded or published more than 500 research papers, and delivered tens of thousands of hours of RAI-focused training to its employees. Amazon has a three-pronged approach to RAI: anticipate risks before they materialize, teach models to navigate ambiguity, and build systems that can adapt — to government transitions, high-profile incidents, new regulations, and other social changes. Below are some of the scientists across Amazon’s responsible-AI and policy teams who put this approach into practice — each tackling a different phase of the AI lifecycle. Teaching foundations: Pretraining Chentao Ye is a senior applied scientist on the AGI RAI team, working on pretraining, the earliest stage of LLM training, where the model develops general linguistic competences. It’s become increasingly critical to address RAI at this stage, says Ye, to ensure that the model has the information necessary to adapt to policies established by Amazon’s policy team. “Pretraining is the stage where we teach our most fundamental concepts of RAI,” Ye says. “It’s like teaching a child about the world before we expect them to make some decisions.” Pretraining typically involves large volumes of public data, but the RAI team augments that data with datasets specifically designed to instill principles of safety, security, and fairness. Those datasets are vast and diverse — a “rich diet” of content including internal and public RAI guidance, best practices, RAI-related news and incidents, information about domains such as chemical and nuclear engineering and coding security, text, audio, and images. Also included in the corpus is information in different languages and from different cultures, to ensure the model is global and multilingual. To help the model better incorporate this array of information, researchers create training tasks, also known as learning exercises, for it. “Having this data isn't enough. We need to help the model process and understand it effectively,” Ye says. For instance, Ye and his colleagues might take a policy document about privacy and convert it into multiple learning exercises: explaining privacy concepts, answering questions about compliance, and determining whether certain actions would violate privacy guidelines. These varied tasks help the model develop a deeper, more nuanced understanding of RAI principles. Another active area of research is how to handle potentially harmful content in the training corpus. “It's not simply about filtering everything out,” Ye explains. “If a model has never encountered certain harmful concepts during pretraining, it won't recognize them as sensitive, making post-training guardrails less effective.” The team is exploring approaches that add educational context to certain filtered content before reintroducing it — teaching the model what harm looks like and why it should be avoided, rather than leaving it entirely unaware. In addition to RAI acquisition, another area of focus is what’s called RAI modality alignment. LLMs need to understand how to apply RAI principles across all the modalities they encounter. Modality alignment maps other modalities into a semantic space they share with text, which is often more readily available, Ye explains. For example, a college textbook might include figures of high-risk chemical, biological, radiological, and nuclear materials (CBRN) and text descriptions of the same concepts. The team designs a range of LLM tasks that effectively encode the data into the same space. One active research area is developing a variety of techniques to test for pretraining quality, says Ye. The team is taking two complementary approaches. The first tests whether the model has actually acquired RAI knowledge during pretraining. “We use metrics like perplexity” — which quantifies how well a probability distribution predicts a given sample — “to measure how well the model can generate content in specific RAI domains,” Ye explains. The second approach tests the way that the model responds to sparse questions that might appear in later testing exercises, where the expected responses — like refusals or deflections — weren't explicitly taught during pretraining. “This helps us test whether the RAI knowledge it gained during pretraining enables it to generalize to real-world scenarios with just limited examples or instructions,” Ye says. Post-training: Reinforcement learning from human feedback Once models learn to follow instructions and produce both helpful and harmless responses, they advance to reinforcement learning from human feedback (RLHF). Senior applied scientist Charith Peris, who leads this phase of model development, and applied scientist Yao Ma explain that RLHF focuses on using feedback from or preference comparison with humans to give models a sense of judgement. “RLHF is done to make sure the foundation model aligns with the behavior expected by humans,” says Peris. This stage of training provides the model with a reward based on how well its response to a query meets a predetermined criterion. The rewards are provided by various response verification systems. One approach uses so-called auxiliary-reward models, which are trained on outputs that humans have ranked. For responsible AI, this stage offers the ability to optimize the model to generate responses that are “policy adherent,” hewing to the rules and guidelines devised by Amazon’s policy team. “Providing the right rewards is a critical part of RLHF,” says Ma. In one case, the core model itself is used to generate multiple responses to a range of unsafe and borderline safe queries. These responses are ranked and rated by humans based on their helpfulness and policy adherence and then used to train auxiliary-reward models. Another response verification approach uses an independent LLM as a judge. The model generates a response for each prompt in the training set, and this response, together with a set of rubrics about what makes a response policy adherent, is passed to the judge. The judge is then instructed to provide a score based on how well the response aligns with the rubrics. Both the auxiliary-reward models and the judge-based systems can be used individually or in combination to provide RLHF rewards. The model is evaluated in two phases: during and after training. In the first phase, the model is tested at frequent, short intervals using lightweight benchmarks that provide directional signals on performance across critical capabilities. In the second phase, saved checkpoints, each a complete snapshot of the model's state and parameters at a given point in training, are systematically evaluated against a broader set of test data to identify which checkpoint achieved the best overall performance. Behavior in check: Evaluations A major focus of the evaluations team is to build model-breaking datasets — robust collections of prompts that trigger inappropriate, unsafe, or policy-violating responses. “We know models are improving month over month,” says Jwala Dhamala, a senior scientist with Amazon AGI . Bigger, better responsible-AI datasets are playing a large part in this, she says, as well as improved mechanisms to capture how well the models incorporate responsible-AI principles spanning multiple modalities and regions. Working closely with Amazon’s policy team, Dhamala says, is key to developing evaluations for RAI. Amazon’s RAI work has eight pillars: privacy and security; safety; fairness; veracity and robustness; explainability; controllability; governance; and transparency. "For each pillar, we focus on tests that could lead the model to output something that violates responsible-AI policies. Simultaneously, we focus on testing if a model is refusing excessively or refusing to respond to benign requests," Dhamala explains. The data comes from everywhere: human experts known as red teamers who try to break models, external security partners, public benchmarks from universities, even social media where real-world problems surface organically. The RAI team evaluates models throughout the model-training and deployment cycle, Dhamala explains, from pretraining to post-training and predeployment, when all scaffolding is attached. Each stage has its own specially designed evaluation processes, and more testing happens in the later stages, when the model is closer to end users. "We collect datasets, evaluate, then collect new datasets, evaluate again,” Dhamala says. She adds that the team is currently working to automate more of the evaluation process. It’s also pushing into newer areas of research. Deception in conversations that require many back-and-forth interactions over weeks or months (also called long-horizon interactions) is emerging as a concern, but there aren't many established benchmarks for detecting it. Creating them requires an understanding of what deception means across different long-horizon contexts, an understanding grounded in social-science research. Another open area of research is an automatic red-teaming framework to evaluate emerging responsible-AI risks. The idea is that an autonomous agent or a system of agents would compete or collaborate in attempts to provoke undesired behaviors. Third-party collaborations: Frontier risks While most RAI work addresses common misuse patterns, Tong Wang, a senior applied scientist with AGI, focuses on a different category of risk: frontier risks, or “systemic risks that could take down entire systems.” These include the use of AI models to research CBRN (chemical biological, radiological, and nuclear) attacks and to research or launch cyberattacks. These are scenarios where AI capabilities could enable nonexperts to cause catastrophic harm. The evaluation process for frontier risks is exacting. First, automated benchmarks test whether the model has acquired dangerous knowledge. If it passes certain thresholds — answering questions about weapons of mass destruction with concerning accuracy — that triggers human review. Third-party experts in relevant domains evaluate whether the model has crossed safety boundaries. And the process is ongoing: with each model update, the team compares the new model’s capabilities against those of earlier models. "We have to be very careful,” Wang says. “False positives and false negatives both have costs." With public models, identified risks are mitigated by guardrails: when a person asks about a particular topic at a particular level of specificity, the model simply won’t respond. But legitimate researchers — scientists at universities and labs with relevant expertise and appropriate oversight — may need access to restricted information for their work. Wang’s team is exploring mechanisms to provide “specialized access with heavy monitoring” for these trusted users. Those mechanisms involve what Wang calls “configurability”, using techniques like low-rank adaptors (LoRA) to make surgical changes to a model's behavior for specific use cases, without retraining the entire model. "We add configuration on top that doesn't touch the base model itself," he says. "You're not retraining a billion parameters, just a few.” Today, this approach is already in use for certain content policies. But extending it to frontier risks like CBRN is a harder problem; both the data collection and computational costs are significantly higher. "It's an open research area, studying which approaches work best," Wang notes. Agreed-upon values: Writing the policies "We partner with the Amazon science team throughout the entire model development lifecycle," explains Claire O'Brien Rajkumar, leader of the responsible-AI policy and product team. The process starts with understanding what a product team wants to launch — whether it's an image generation model or a large language model — and mapping potential harms against Amazon's eight core dimensions of responsible AI. Before building an image generator, for instance, the team might anticipate risks such as deepfakes, bias amplification (for instance, an image depicting doctors only as white males), or attempts to generate disturbing content. Identified risks are translated into specific policies that define behavioral boundaries for the model under development. These policies become "backward-working guidelines," O’Brien Rajkumar says, that inform every subsequent decision during model building. For instance, rather than sourcing images from a single vendor that might show only white male doctors, the team ensures diverse data collection that reflects the complexity of the real world. Amazon’s policies are informed by factors including industry trends, customer requests, regulations, and legal requirements (particularly around copyright and content licensing). The team actively participates in industry groups like the Frontier Model Forum and Partnership on AI, collaborating with competitors to establish best practices in an under-regulated space. Academic partnerships help identify emerging risks through the development of benchmarks as well as engagements such as the Trusted AI track of the Amazon Nova AI Challenge, where university students compete to identify safety vulnerabilities in Nova models and the associated fixes. Customer feedback shapes practical policy decisions, such as carving out exceptions for legitimate use cases such as LLM-based security testing, even when the general policy prohibits malware generation. The policy team operates through cross-functional working groups that include legal, public-policy, product, security, and RAI experts. Regulatory developments like the EU AI Act and California's AI Transparency Act directly influence policy evolution. "These are living, breathing things," O'Brien Rajkumar notes, acknowledging that policies must adapt as society becomes more comfortable or less comfortable with certain AI risks. Beyond policy development, and specific responsible-product guidelines, the team manages the implementation of AI safeguards and oversees red-teaming operations using both in-house experts and third-party vendors. It also conducts manual reviews of model outputs to assess real-world risk. “These are high-judgement decisions, working on the boundaries of what violates policy or not,” says O’Brien Rajkumar. “We have to really understand what each policy means in practice.”

この記事をシェア

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

AWS Machine Learning Blog重要度42026年6月26日 23:42

AWS を活用した保険仲介向けドメイン特化型 AI の先駆者、Cara の取り組み

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む