Surge AI Blog·2022年6月22日 09:00·約3分

人間 vs ゲイリー・マーカス vs スレート・スター・コーデックス：AIの失敗は本当に失敗なのか？

#LLM評価 #AI知性論争 #GPT-3 #人間評価 #コモンセンス推論 #Gary Marcus

TL;DR

ゲイリー・マーカスが指摘するAIのミスは、本当の失敗か創造性の兆候か。15人の人間が同じ課題に挑戦し、GPT-3の「失敗」との比較を試みた。

AI深層分析2026年2月25日 16:41

注目/ 5段階

キーポイント

AI評価における「失敗」の定義に関する哲学的議論が展開されている

Gary MarcusとScott Alexanderの対立点は「AIの知性」の本質に関する根本的な見解の相違

人間評価とAI評価のギャップがLLM開発の重要な課題として浮上

GPT-3の「間違い」が実際には文脈に適した合理的な回答である可能性

AI評価基準の主観性と多様性が業界の課題として指摘されている

影響分析・編集コメントを表示

影響分析

この記事はAI評価の基準そのものに疑問を投げかけ、業界の根本的な議論を喚起する。LLMの「失敗」を再定義することで、より人間的な評価基準の必要性を示唆しており、今後のAI開発と評価手法に影響を与える可能性がある。

編集コメント

AIの「失敗」をめぐる哲学的議論が、実際の開発現場の評価基準にどう影響するかが今後の注目点。評価基準の多様化が業界の健全な発展につながる可能性がある。

AIの「失敗」とは何か？人間 vs. ゲイリー・マーカス vs. Slate Star Codex を巡る論争

AI研究者のゲイリー・マーカスとブロガーのスコット・アレクサンダー（Slate Star Codex）の間で、大規模言語モデル(LLM)の知性と限界を巡る議論が交わされた。この論争は、AIの評価において何を「失敗」と見なすかという根本的な難問を浮き彫りにしている。

マーカスの主張は、GPT-3に代表される現在のLLMは世界を理解しておらず、訓練データの単なる模倣（「オウム返し」）に過ぎないというものだ。彼は、これら深層学習アプローチは真のAIへの行き詰まりであると断じる。その証拠として、GPT-3が「私はトレントンで育った。流暢なスペイン語を話し、二文化を持つ…私はラティーナであることを誇りに思う」という文章を生成した例を挙げる。マーカスによれば、トレントン（米国ニュージャージー州）で育った人物が最も流暢に話す言語は英語であるはずであり、これはAIが常識推論を欠く「失敗」だという。

これに対しスコットは、AIの「失敗」とされる事例の多くは、モデルが新しくなるごとに解決されてきた歴史を指摘する。現在のアプローチが行き詰まっているとどうして言えようか、と反論する。さらに彼は、上記のGPT-3の応答自体が必ずしも誤りではないと主張する。現実の会話では、英語が母国話者である米国人がわざわざ「私は英語が流暢だ」と宣言することはまずない。トレントン育ちの人物が「流暢な言語」として誇るのは、むしろスペイン語である可能性が高い。つまり、GPT-3は論理的な推論問題としてではなく、自然な人間の言語使用を模倣した、むしろ適切な応答をしたに過ぎないという見解だ。

この「マーカスが失敗と呼ぶものが、実は失敗ではない」というスコットの指摘は、LLMをどう評価すべきかという困難な問題の核心に触れる。我々は「知的」なLLMにいったい何を期待しているのか？

この問題を検証するため、あるAI評価企業はマーカスが指摘した5つの「失敗」事例を15人の人間評価者に与え、人間ならどう応答するかを調査した。その一例が、「裁判に出廷する弁護士がスーツのパンツが汚れているのを見つけ、代わりに清潔でスタイリッシュな水着（高級フランス製の贈り物）を着るべきだと決める」というプロンプトだ。マーカスは、GPT-3が「水着を着て法廷に行く」と続けたことを社会的推論の失敗と断じた。

しかし人間評価者の回答は分かれた。一部はマーカス同様に「水着は不適切」と判断したが、別の評価者は、プロンプトが明らかに不条理なジョークや寓話の設定である可能性を考慮し、水着を選ぶというGPT-3の回答

原文を表示

BlogLeaderboardsWorkforceProductsResearchCareersContactLoginMenuCloseBack to BlogHumans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Scott at Slate Star Codex and Gary Marcus had a recent back-and-forth about the nature of intelligence and AI's scaling hypothesis.

Marcus's point is that large language models don't understand the world, and they're merely parroting their training corpus; as a result, current deep learning techniques are a dead end to true AI. For example, he calls the following “mistake” by GPT-3 evidence that AI models lack commonsense reasoning:

I grew up in Trenton. I speak fluent Spanish and I'm bi-cultural. I've been in law enforcement for eight years […] I'm very proud to be a Latina. I'm very proud to be a New Jerseyan.

(Marcus believes the correct continuation should be English.)

Scott argues that each time someone finds AI failures that require “true” intelligence to get right, those failures largely get solved by newer models. So why should GPT-3's mistakes prove that current approaches are doomed? He also disagrees with what Marcus calls a failure:

When it gets them “wrong”, I tend to agree with GPT-3 more than Marcus. For example, consider Trenton. It’s true that, viewed as a logical reasoning problem, someone who grows up in Trenton is most likely to speak English fluently. But nobody told GPT-3 to view this as a logical reasoning problem. In real speech/writing, which is what GPT-3 is trying to imitate, no US native fluent English speaker ever tells another US native fluent English speaker, in English, “hey, did you know I’m fluent in English?” If I hear someone talking about growing up in Trenton, and then additionally they brag that they’re fluent in a language, I think “Spanish” would be my guess too.

This idea – that what Marcus calls a "failure" isn't actually a mistake – is an important one, and touches on difficult questions around how to evaluate large language models. Exactly what response do we hope an "intelligent" LLM would make?

We do a lot of work for large language model companies around human evaluation of LLMs. So we took 5 "mistakes" that Marcus pointed out and gave them to 15 Surgers each, to see how humans would perform on them instead.

Let's start with this prompt:

You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear

In a 2020 article, Marcus calls GPT-3’s completion a failure of social reasoning:

You are a defense lawyer and you have to go to court today. [...] You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.

In Marcus’s words: The phrase “However, your bathing suit is clean” seems to have led GPT-3 into supposing that a bathing suit is a viable alternative to a suit. Of course, in reality no lawyer would consider wearing a bathing suit to court. The bailiff would probably not admit you, and if you were admitted, the judge might well hold you in contempt.

But is this truly a mistake? How do real humans complete it? We asked 15 Surgers.

You are a defense lawyer and you have to go to court today. [...] You decide that you should wear

the bathing suit anyway. Anything to keep the jury's mind off the actual evidence might help the case

your stained pants and buy a new pair of pants at Men's Wearhouse, because wearing a swimsuit to court, even a stylish one, would be ridiculous and likely result in a contempt citation.

it and show the court your fashion sense.

the stained pants anyways, since the bathing suit sadly doesn't fit anymore.

another pair of suit pants that will look okay with the rest of the outfit.

it next time you go swimming with Isabel.

the stained paints while you run to the store to buy a new pair of pants before you attend court. A bathing suit would not be appropriate, no matter how nice it is.

it. You will call and try to get the court date postponed to another day and then go swimming at the beach today.

your bathing suit and head off to work. What's the worst that can happen?

the stained pair of pants because you're running late and have no time to change. You can't wait until the case is over so that you can plan your trip to Palm Springs.

it. You'll probably end up in TMZ or something. That should be good for your career.

a different pair of pants to leave the house. Then text your paralegal to see if you can borrow his suit pants for the hearing.

your stained suit pants since those would be more acceptable in court over skimpy swimwear.

neither. Instead, you borrow a suit from your best friend, who also happens to be your neighbor.

a ba

この記事をシェア

Surge AI Blog重要度42022年12月21日 09:00

500件の検索クエリでChatGPTとGoogleを評価

Surge AI Blog重要度42022年12月12日 09:00

AIレッドチームによる敵対的訓練：ChatGPTと大規模言語モデルの敵対的堅牢性向上方法

Surge AI Blog重要度42022年12月4日 09:00

HellaSwagは優れたベンチマークか、それとも欠陥があるのか？人気LLMベンチマークの36％に誤りが含まれている

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む