読み込み中…

Amazon Science·2026年6月4日 00:56·約6分

正解はデータセットではなくプロセスである

#RAG #Evaluation #Fact-checking #Reasoning

TL;DR

Amazon Science は、複雑な AI 生成レポートの評価において静的なデータセットに依存する従来の「正解（ground truth）」概念の限界を指摘し、モデル自体がベンチマークを検証する「監査後スコアリング」というプロセス型アプローチを提案した。

AI深層分析2026年6月4日 18:01

重要/ 5段階

深度40%

キーポイント

静的な Ground Truth の限界

複雑な AI 生成レポート（深層研究）の評価において、人間専門家が単独で検証する従来の手法では正解率（60.8%）が低く、モデルの誤りではなくベンチマーク自体の不備を示唆している。

Audit-then-Score プロトコルの提案

人間による一次ラベルを絶対的な正解とせず、AI モデルがそのラベルを検証・監査（audit）するプロセスを導入し、その後スコアリングを行う新しい評価フレームワークを提唱している。

評価基準の動的化

認知負荷の高いタスクにおいて、モデルとベンチマークの不一致は必ずしもモデルの失敗ではなく、ベンチマークの曖昧さや不完全さを示すシグナルとして再解釈する必要がある。

重要な引用

Today, the key challenge in AI isn't only how to build better models; it's how to build evaluation systems that can keep up.

Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process.

Sometimes, a model's 'error' is actually a signal that the benchmark itself is ambiguous, incomplete, or wrong.

影響分析・編集コメントを表示

影響分析

このアプローチは、AI の信頼性評価におけるパラダイムシフトを意味しており、単にモデル性能を測るだけでなく、評価基準そのものの質を検証するメタ評価の重要性を浮き彫りにします。特に RAG やエージェント型 AI が複雑な推論を行うようになる中、静的なベンチマークに依存した評価手法の限界を克服する指針となり得ます。

編集コメント

AI の進化が評価基準の限界を露呈させた象徴的な事例であり、開発者にとっては「モデルだけでなく評価系そのものを設計し直す」必要性を強く示唆する内容です。

今日、AI における主要な課題は、より優れたモデルを構築する方法だけでなく、それらに追いつく評価システムをいかに構築するかです。検索支援型 AI システムは現在、深い調査レポートを作成できるようになりました。これは多くのソースからの長くて洗練された合成であり、専門家による分析とますます類似したものとなっています。しかし、これらのレポートが有用なのは、その主張が基礎となる文献によって裏付けられている場合に限られます。既存の事実確認ツールの多くは、主張を短い引用や単一の文書に一致させることができる場合に最もよく機能します。しかし、AI 生成の研究レポートでは、1 つの文が複数のソースからの証拠を組み合わせている可能性があります。それは報告書の周囲の文脈に依存し、単独のソースでは行わないような主張の比較を行うこともあります。

Amazon の人工一般知能（AGI）グループがこの AI 生成研究レポートの評価という問題に取り組み始めた際、私たちは主要な技術的課題はより強力な AI 事実確認器を構築することだと考えていました。しかし、AI 事実確認器を評価する前に、性能を測定するために使用されるベンチマーク、つまり標準化されたテストセットが必要です。そしてこの設定では、ベンチマークの構築がモデルの構築と同じくらい、あるいはそれ以上に困難であることがわかりました。

伝統的に、私たちは問題に対する正解（ground truth）を固定されたデータセットとして捉えてきました。しかし、複雑な AI を適切に評価するためには、正解はプロセスにならなければならないことを発見しました。私たちはそのプロセスを「監査後スコアリング」と呼び、最近 arXiv に発表した論文で、このプロセスとそれに付随する 2 つのデータセットを紹介しています。

静的データセットが機能しなくなる理由

AI の性能を測定する標準的な方法では、人間专家が例にラベル付けを行い、そのラベルが「正解」（疑いの余地のない正解）となり、モデルはそれに対してスコアリングされます。このアプローチを AI 生成研究レポートでテストするために、コンピュータサイエンス、制御理論、教育学、公衆衛生、環境工学などの分野から博士号レベルの専門家を募集しました。彼らには、各自の専門分野におけるレポートからの主張を検証してもらい、答えが既知である隠されたセットの主張を混ぜてもらいました。

その結果は厳しいものでした。統制された研究において、補助なしの専門家たちは、既知の答えからなる隠されたセットに対して 60.8% の精度しか達成できませんでした。問題は専門知識の欠如ではありませんでした。深い調査における事実性の評価が、非常に要求の高いタスクであるという点です。1 つの主張を検証するには、長い文脈での読書、文書間の合成、そして持続的な注意が必要になることがあります。

通常、機械学習では、モデルがベンチマークと異なる場合、私たちはモデルに誤りがあったと仮定します。しかし、深い調査のような認知負荷の高いタスクにおいては、不一致は自動的にモデルの失敗として扱われるべきではないことに気づきました。時には、モデルの「エラー」は、ベンチマーク自体が曖昧であるか、不完全であるか、あるいは間違っているというシグナルなのです。

監査後スコアリング

初期の専門家ラベルを疑う余地のない正解として扱うのではなく、私たちはモデルを使ってベンチマークを積極的に精査することを決定しました。これが「監査後スコアリング」プロトコルの核心となる考え方です。私たちの論文では、このプロトコルとともに、システムを比較するための共有テストセットである DeepFact-Bench と、文献がレポートの主張をサポートしているかを確認するシステムである DeepFact-Eval を紹介しています。

このプロトコルの仕組みは以下の通りです。AI 事実確認器が現在のベンチマーク回答と異なる場合、単にペナルティを科されるのではなく、挑戦者として行動し、元の人間の答えが間違っている理由について具体的な証拠と書かれた根拠を提出する必要があります。その後、監査人（人間专家でも可）が介入します。重要なのは、監査人がゼロから始めるのではなく、挑戦者の新しい証拠をベンチマークの元の根拠と比較する点です。もし挑戦者がより強力な主張を立てた場合、モデルにスコアを与える前にベンチマークを修正します。

DeepFact-Eval は完全なレポート文脈を読み込み、関連文献を網羅するための検索計画を立て、取得した文書を要約し、重要な詳細が不足している場合は追跡質問を行います。そして、最終的な判断と書かれた説明の両方を生成します。これはベンチマークとは何かという根本的な概念を変えます。

人間専門性の新たな役割

私たちが発見した最も印象的なことの 1 つは、単発のラベラーとしては信頼性が低かった同じ専門家たちが、監査人の役割に置かれると、はるかに信頼性が高くなるということです。4 ラウンドの「監査後スコアリング」を通じて、隠されたテストセットにおける精度は 60.8% から 90.9% に向上しました。

専門家が空白の状態から始めると、証拠を見つけ、解釈し、独自の判断を下す必要がありますが、論争のある主張を監査する場合は、2 つの具体的なケースを比較することに集中できます。この変化には大きな影響がありました。DeepFact-Bench において、基盤モデルとして GPT-4.1 を使用した場合、DeepFact-Eval は 83.4% の精度を達成しました。これは、テストした従来の事実確認システムの中で最良の 58.5% や、強力な先行する深層調査システムの 69.1% よりも高い数値です。

進化し続けるインフラとしての評価

この変化は、1 つの論文や 1 つのタスクを超えた意味を持ちます。AI システムがさらに改善され、人間のような専門性を示すレベルに達した場合、コミュニティは一度きりの人間の回答に基づく評価では不十分な状況にますます直面することになるでしょう。そのような状況では、ベンチマークの品質を維持するには、監査、修正、較正、そして定期的な再検証が必要となるかもしれません。評価は、人間、モデル、そしてそれらが共同で提示する証拠の間における継続的な協力へと進化していくことになります。

謝辞：Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Markus Dreyer

原文を表示

Today, the key challenge in AI isn’t only how to build better models; it’s how to build evaluation systems that can keep up. Search-augmented AI systems can now produce deep research reports — long, polished syntheses of many sources that increasingly resemble expert analysis. But those reports are useful only if their claims are supported by the underlying literature. Most existing fact-checking tools work best when a claim can be matched to a short quote or a single document. But in AI-generated research reports, a single sentence may combine evidence from several sources. It can depend on the surrounding report for context, and it might compare assertions in a way that no single source does on its own. When Amazon’s Artificial General Intelligence (AGI) group started working on the problem of evaluating AI-generated research reports, we thought that the main technical challenge would be building a stronger AI fact checker. But before you can evaluate an AI fact checker, you need a benchmark, a standardized test set used to measure performance. And in this setting, building the benchmark turned out to be at least as hard as building the model. Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process. We call that process audit-then-score, and we present it, together with two accompanying datasets, in a paper we recently published to arXiv. When static datasets break down In the standard method for measuring AI performance, human experts label examples, those labels become the “ground truth” (the undisputed correct answers), and models are scored against them. To test this approach with AI-generated research reports, we recruited PhD-level specialists from fields such as computer science, control theory, education, public health, and environmental engineering. We asked them to verify claims from reports in their own specialties, mixing in a hidden set of claims whose answers we already knew. The result was sobering. In a controlled study, unassisted experts achieved only 60.8% accuracy on the hidden set of known answers. The issue was not a lack of expertise. It was that assessing deep-research factuality is an unusually demanding task. Verifying a single claim can require long-context reading, cross-document synthesis, and sustained attention. Normally, in machine learning, when a model disagrees with a benchmark, we assume the model made a mistake. But we realized that, in cognitively demanding tasks like deep research, disagreement should not automatically be treated as a model failure. Sometimes, a model’s “error” is actually a signal that the benchmark itself is ambiguous, incomplete, or wrong. Audit, then score Instead of treating the initial expert labels as unquestionable ground truth, we decided to use the models to actively scrutinize the benchmark. This is the core idea behind the audit-then-score protocol. Our paper introduces the protocol alongside DeepFact-Bench, a shared test set for comparing systems, and DeepFact-Eval, a system that checks whether literature supports report claims. Here is how the protocol works: When our AI fact checker disagrees with the current benchmark answer, it is not simply penalized. Instead, it acts as a challenger and must submit concrete evidence and a written rationale for why it thinks the original human answer is wrong. An auditor — which can be a human expert — then steps in. Crucially, auditors do not start from scratch; they compare the challenger’s new evidence directly against the benchmark’s original rationale. If the challenger makes the stronger case, we revise the benchmark before we score the model. DeepFact-Eval reads the full report context, plans searches to cover the relevant literature, summarizes retrieved documents, and asks follow-up questions when key details are missing. It then produces both a verdict and a written explanation. This fundamentally changes what a benchmark is. A new role for human expertise One of the most striking things we found is that the same experts who were unreliable as one-shot labelers became far more reliable when placed in the role of auditor. Across four rounds of audit-then-score, accuracy on our hidden test set rose from 60.8% to 90.9%. When experts start from a blank page, they have to find the evidence, interpret it, and make a judgment on their own; when they audit a disputed claim, they can focus on comparing two concrete cases. This shift had significant impact. On DeepFact-Bench, DeepFact-Eval reached 83.4% accuracy when we used GPT-4.1 as the underlying model. That was higher than the 58.5% of the best traditional fact-checking system we tested and the 69.1% of a strong prior deep-research system. Evaluation as an evolving infrastructure This shift has implications beyond one paper or one task. If AI systems continue improving, to the point that they exhibit humanlike expertise, the community will increasingly run into settings where evaluation based on one-time human answers is not enough. In those settings, sustaining benchmark quality may require auditing, revision, calibration, and periodic revalidation. Evaluation will become an ongoing collaboration among humans, models, and the evidence they surface together. Acknowledgments: Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Markus Dreyer

この記事をシェア

AWS Machine Learning Blog2026年7月22日 01:23

Amazon Nova、自己蒸留型推論をSFTに活用

AWS Machine Learning Blog重要度42026年7月21日 02:01

AWS と NVIDIA が業務用エージェントワークフローを公開

AWS Machine Learning Blog重要度42026年7月21日 01:58

Couchbase、Amazon Bedrock で多モデル AI 基盤を構築

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Amazon Science·2026年6月4日 00:56·約6分

正解はデータセットではなくプロセスである

#RAG #Evaluation #Fact-checking #Reasoning

TL;DR

AI深層分析2026年6月4日 18:01

重要/ 5段階

深度40%

キーポイント

静的な Ground Truth の限界

Audit-then-Score プロトコルの提案

評価基準の動的化

重要な引用

Today, the key challenge in AI isn't only how to build better models; it's how to build evaluation systems that can keep up.

Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process.

Sometimes, a model's 'error' is actually a signal that the benchmark itself is ambiguous, incomplete, or wrong.

影響分析・編集コメントを表示

影響分析

編集コメント

静的データセットが機能しなくなる理由

監査後スコアリング

人間専門性の新たな役割

進化し続けるインフラとしての評価

謝辞：Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Markus Dreyer

原文を表示

この記事をシェア

AWS Machine Learning Blog2026年7月22日 01:23

Amazon Nova、自己蒸留型推論をSFTに活用

AWS Machine Learning Blog重要度42026年7月21日 02:01

AWS と NVIDIA が業務用エージェントワークフローを公開

AWS Machine Learning Blog重要度42026年7月21日 01:58

Couchbase、Amazon Bedrock で多モデル AI 基盤を構築

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

正解はデータセットではなくプロセスである

キーポイント

重要な引用

影響分析

編集コメント

関連記事

正解はデータセットではなくプロセスである

キーポイント

重要な引用

影響分析

編集コメント

関連記事