Googleの研究が発見:AIベンチマークは人間の意見の相違を体系的に無視している
Googleの研究は、AIベンチマークで一般的な3〜5人の人間評価者では信頼性が不十分であり、評価予算の配分方法が予算規模自体と同様に重要であることを発見した。
キーポイント
ベンチマーク評価の信頼性問題
現在のAIベンチマークでは、テスト例ごとに3〜5人の人間評価者を割り当てるのが一般的だが、この人数では信頼性のある評価が得られないという問題をGoogleの研究が指摘している。
評価予算の最適配分の重要性
研究では、評価に投入する予算の総額だけでなく、その予算をどのように配分するか(例:評価者数を増やすか、評価対象例を増やすか)が同様に重要であることが示された。
人間の意見の不一致の無視
現在のベンチマーク手法は、人間評価者間の意見の不一致(disagreement)を体系的に無視しており、これが評価結果の信頼性を損なう要因となっている。
影響分析・編集コメントを表示
影響分析
この研究はAI評価方法論の根本的な見直しを促し、より信頼性の高いベンチマーク設計への道筋を示している。AI開発コミュニティは評価プロセスの改善を通じて、モデル性能のより正確な測定が可能になるだろう。
編集コメント
AI評価の信頼性向上に向けた方法論的な研究で、実務的な示唆に富む内容。ベンチマーク設計の現場ではすぐに適用可能な知見を提供している。

Googleの研究により、テスト事例ごとに標準的な3〜5人の人間による評価者を配置するだけでは、信頼性の高いAIベンチマークを得るには不十分な場合が多く、アノテーション予算をどのように配分するかは、予算の総額自体と同程度に重要であることが明らかになりました。
この記事「AI benchmarks systematically ignore how humans disagree, Google study finds」は、The Decoderで最初に公開されました。
原文を表示
How many evaluators does a good AI benchmark actually need? New research shows that the standard three to five raters per test example often aren't enough, and that how you allocate your annotation budget matters just as much as how big it is.
When AI models go head-to-head, human evaluations often decide which one comes out on top. Evaluators rate things like whether a comment is toxic or whether a chatbot response is safe.
The problem is that people frequently disagree on these calls. Standard practice in AI research is to collect three to five ratings per example and pick a single "correct" answer by majority vote. That approach systematically throws out the diversity of human opinion.
Both comments get the same "Toxic" label by majority vote, even though evaluators in the second case disagree significantly. Standard benchmarks ignore this difference entirely. | Image: Google
Researchers from Google Research and the Rochester Institute of Technology wanted to find a smarter way to spend a limited rating budget. The key question: Is it better to evaluate as many test examples as possible or to have fewer examples rated by a lot more people?
The researchers frame the dilemma with a simple restaurant analogy. Imagine asking 1,000 guests to each sample a single dish: you'd get a broad but shallow snapshot. Now imagine asking 20 diners to rate the same 50 dishes. You'd walk away with a far richer picture of what's actually good and what isn't. Today's AI benchmarks overwhelmingly follow the first model, casting a wide net across test examples while collecting only a thin layer of human judgment for each one.
Stress-testing thousands of budget splits
To find the sweet spot, the team built a simulator that replicates human rating patterns using real datasets. The simulator generates synthetic evaluation data for two models, with one performing worse than the other in a controlled way. This setup makes it possible to test which conditions let you reliably detect the difference between models.
When comparing models, both AI systems and human raters evaluate the same text. A metric then determines which model gets closer to the human judgments. | Image: Google
The team calibrated the simulator against five real datasets covering toxicity detection, chatbot safety, and cross-cultural offensiveness assessment. All told, they tested thousands of combinations across different total budgets and rater counts per example.
Fewer than ten raters per example isn't cutting it
The results put current practice in question. The typical one to five raters per test example often aren't enough to make model comparisons reproducible, according to the study. For statistically reliable results that actually capture the range of human opinion, you generally need more than ten raters per example.
More raters per example means more reliable detection of differences between models. The effect is especially strong with smaller budgets. | Image: Google
The experiments also show that reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters. Get the balance wrong, and you can end up with unreliable conclusions even on a much larger budget, the researchers say.
What you measure should dictate how you spend
The biggest takeaway is that there's no one-size-fits-all ratio. The right strategy depends entirely on what you're trying to measure.
If you're using accuracy—checking whether a model agrees with the evaluators' majority vote—a wide approach works best: as many test examples as possible with just a few raters each. Accuracy only looks at the most common answer, so extra raters barely move the needle.
But if you want to capture the full spread of human responses—using a metric like total variation, for instance—you need the opposite playbook. Fewer test examples, but way more raters per example. That's the only way to map how much evaluators actually agree or disagree.
Different examples can get the same majority-vote label yet have very different response distributions underneath. In the experiments, this distribution-aware metric also needed the smallest overall budget to produce reliable results.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
関連記事
Gemma 4:バイト単位で最も能力の高いオープンモデル
GoogleがGemma 4を発表した。高度な推論とエージェントワークフロー向けに設計された、これまでで最も知的なオープンモデルである。
GoogleのGemma 4が初めてApache 2.0ライセンスで利用可能に
Googleが最も高性能なオープンモデルファミリー「Gemma 4」をリリースした。4つの新モデルはスマートフォンからワークステーションまで幅広く動作し、初めて完全にオープンなApache 2.0ライセンスで提供される。
Google、オープンモデルファミリーGemma 4を発表
Googleは、高度な推論とマルチモーダル機能を備えたオープンモデルファミリー「Gemma 4」を発表した。