The Gradient·2025年6月4日 23:00·約8分

AGIはマルチモーダルではない

#AGI #LLM #マルチモーダル #具身知能 #世界モデル

TL;DR

The Gradient の記事は、現在のマルチモーダル LLM が物理世界における身体性や相互作用を欠いているため、真の AGI への道筋として失敗すると主張し、環境との相互作用を前提とした新しい知能設計の必要性を説く。

AI深層分析2026年5月3日 02:03

重要/ 5段階

深度40%

キーポイント

マルチモーダルアプローチの限界

現在のマルチモーダル AI は単に異なるモダリティを結合したものであり、センサーモーター推論や社会協調といった真の一般知能に必要な能力を持たないため、近未来での AGI 達成は困難である。

身体性（Embodiment）の欠如

LLM は物理世界を直接体験していないため、車修理や結び目の解き方といった実世界の課題解決に必要な「暗黙的な身体的理解」を欠いており、これは AGI 定義において致命的な欠陥である。

次単語予測の誤解

LLM が世界モデルを学習しているという説に対し、著者は実際にはトークン予測のためのヒューリスティックの寄せ集めに過ぎず、現実に対する表面的な理解しか持たないと反論する。

新しい知能設計への提言

モダリティ中心の処理を_patchwork_として結合するのではなく、身体性と環境との相互作用を第一原理とし、そこからモダリティ処理が派生するアプローチへ転換すべきだと主張している。

言語と物理世界の根本的な違い

オセロのような記号体系のゲームは完全な状態推論が可能だが、掃除や運転などの物理世界タスクには言語記号だけでは不十分で、物理的実体の理解が必要である。

LLM の世界モデル仮説への懐疑

「言語は現実の構造を反映する」という比喩が過度に文字通り解釈され、LLM が人間のように物理世界を内在的に理解しているという誤解を招いている。

記号操作の限界

物理世界の多くの問題は記号システムで完全に表現できず、単なる記号操作では解決できないため、LLM が真の世界モデルを持っているとは限らない。

影響分析・編集コメントを表示

影響分析

この記事は、現在の AI 業界が「規模の拡大」と「マルチモーダル化」に過度に依存している現状に対する重要な警鐘であり、AGI 実現に向けた技術的アプローチの根本的な見直しを促すものである。特に、物理世界での動作や身体性を無視した言語モデル中心のアプローチには限界があるという指摘は、ロボティクスや具身知能（Embodied AI）の研究分野における重要性を再認識させる契機となる。

編集コメント

現在の生成 AI ブームに対する冷静な批判として、物理世界との相互作用を欠いた言語モデルが真の知能に到達できないという視点は、研究コミュニティにおいて極めて重要な議論を喚起する内容です。

「言語を思考のモデルとして投影し直すことで、私たちは知性を支える暗黙の身体的了解を見失う」 – テリー・ウィノグラッド

生成AIモデルの最近の成功は、一部の人々にAGI（人工汎用知能）が間近であると確信させている。これらのモデルは人間の知性の本質を捉えているように見えるが、それらは知性に関する私たちの最も基本的な直感さえも裏切る。それらが登場したのは、知性という問題に対する思慮深い解決策だからではなく、既に所有していたハードウェア上で効果的にスケールしたからだ。スケールの果実に魅了された一部の人々は、それがAGIへの明確な道筋を提供すると信じるようになった。この最も象徴的なケースがマルチモーダルアプローチであり、そこでは巨大なモジュール型ネットワークが一連のモダリティに対して最適化され、それらを合わせると汎用的に見える。しかし、私はこの戦略が近い将来に確実に失敗すると主張する。それは、例えば感覚運動推論、動作計画、社会的調整を行える人間レベルのAGIにはつながらないだろう。モダリティを継ぎはぎのAGIに接着しようとする代わりに、私たちは、身体化と環境との相互作用を主要なものとして扱い、モダリティ中心の処理を創発現象として見る知性へのアプローチを追求すべきだ。

序文：人工汎用知能（AGI）の非身体的な定義——「汎用」に重点を置く——は、AGIが解決できると期待すべき重要な問題領域を除外している。真のAGIは、すべての領域にわたって汎用的でなければならない。完全な定義は、少なくとも物理的現実に起因する問題（例：車の修理、結び目の解きほぐし、食事の準備など）を解決する能力を含まなければならない。次のセクションで論じるように、これらの問題に必要なのは、基本的に物理的世界モデルのようなものに位置づけられた知性の形態である。これに関するさらなる議論については、『Designing an Intelligence』（編者：George Konidaris、MIT Press、近日刊行）に注目してほしい。

なぜ世界が必要なのか、そしてLLMがそれを理解しているふりをする方法

TLDR: 私はまず、真のAGIには世界に対する物理的理解が必要だと論じる。なぜなら、多くの問題は記号操作の問題に変換できないからだ。LLMは次のトークン予測を通じて世界のモデルを学習していると示唆する人もいるが、LLMはトークンを予測するためのヒューリスティックの袋を学習している可能性が高い。これにより、彼らは現実を表面的にしか理解できず、その知性についての誤った印象を生み出す一因となっている。

次のトークン予測という目的の最も衝撃的な結果は、それが私たちのように世界を観察したことがないにもかかわらず、人間らしい深い世界理解を反映するAIモデルを生み出すことだ。この結果は、言語を理解すること、さらには世界を理解すること——私たちが長い間、言語理解の前提条件だと信じてきたもの——の意味について混乱を招いている。LLMの能力に対する一つの説明は、それらが次のトークン予測を通じて世界のモデルを帰納しているという新興の理論から来ている。この理論の支持者は、最先端のLLMが様々なベンチマークで示す卓越した能力、大規模モデルが類似した内部表現に収束すること、そして彼らが好む「言語は現実の構造を映し出す」という考えの解釈を引用する。この考えは少なくともプラトン、ウィトゲンシュタイン、フーコー、エーコによって支持されてきた概念だ。私は一般的に研究のインスピレーションを得るために難解な文献を掘り起こすことを支持するが、この比喩があまりにも文字通りに受け取られていることを懸念している。LLMは本当に世界の暗黙のモデルを学習しているのか？そうでなければ、どうしてそれほど言語に熟達できるのだろうか？

LLMの世界モデリング仮説を支持する証拠の一つは、オセロ論文である。研究者たちは、合法手の系列で訓練されたトランスフォーマーモデルの隠れ状態からオセロゲームの盤面を予測することができた。しかし、これらの結果を自然言語モデルに一般化するには多くの問題がある。一つには、オセロの手はオセロ盤の完全な状態を推定するために使用できることが証明可能であるのに対し、物理世界の完全な絵を言語記述から推論できると信じる理由はない。オセロゲームが物理世界の多くのタスクと異なる点は、オセロが根本的に記号の世界に存在し、単に人間がプレイしやすいように物理的なトークンを使って実装されていることだ。オセロの完全なゲームはペンと紙だけでプレイできるが、例えば、ペンと紙だけで床を掃いたり、皿洗いをしたり、車を運転したりすることはできない。そのようなタスクを解決するには、人間が単に語ること以上の、世界に関する何らかの物理的概念が必要だ。その世界概念が形式的な世界モデルにエンコードされるか、例えば価値関数にエンコードされるかは議論の余地があるが、物理世界には記号システムでは完全に表現できず、単なる記号操作では解決できない問題が多くあることは明らかだ。

メラニー・ミッチェルの最近の記事で述べられ、この論文によって支持されているもう一つの問題は、生成モデルが、そのような系列データを生み出した世界のモデルを学習することなく（例えば、包括的な一連の特異的なヒューリスティックを学習することで）、系列予測タスクで驚くほど高いスコアを獲得できるという証拠があることだ。例えば、このブログ記事で指摘されたように、OthelloGPTは、実際にはすべての可能なオセロゲームに当てはまらない系列予測ルールを学習した。例えば「入力文字列でA4の前にB4のトークンが現れない場合、B4は空である」といったルールだ。世界モデルが世界の次の状態をどのように予測するかは重要でないと主張することはできるが、その予測がそのようなデータを生み出した根底にある世界よりも、訓練データのより良い理解を反映している場合には、疑念を抱かせるべきだ。これは残念ながら、次のトークン予測という目的の中心的な欠陥であり、この目的は次のトークンの予測に関連する情報のみを保持しようとする。もし世界モデルよりも学習しやすい何かでそれができるなら、おそらくそうなるだろう。

以前の記号が後の記号に及ぼす影響を予測することが、人間が知覚から生成するような世界モデルを必要とすると、留保なく主張することは、「世界モデル」という概念を濫用することになる。世界が何であるかについて意見が一致しない限り、真の世界モデルは、状態の履歴が与えられたときに物理世界の次の状態を予測するために使用できるはずだことは明らかであるべきだ。物理世界の高忠実度観測を予測する同様の世界モデルは、モデルベース強化学習、ロボティクスにおけるタスク・動作計画、因果的世界モデリング、コンピュータビジョンの分野など、AIの多くのサブフィールドで活用され、物理的現実に具体化された問題を解決している。LLMは、あなたの人、場所、物がブレッドボックスより大きいかどうかを尋ねるとき、その潜在的な次トークン計算の中で物理シミュレーションを実行しているわけではない。実際、私はLLMの振る舞いは学習された世界モデルによるものではなく、記号の振る舞いを支配する理解しがたいほど抽象的な規則の力任せの記憶、すなわち統語論のモデルによるものだと推測する。

統語論は、様々な文法カテゴリー（例：品詞）の単語がどのように組み合わされて文になり、統語木に解析できるかを研究する言語学の一分野である。統語論は文の構造と、それを構成する品詞の原子的部分を研究する。

意味論は、文の文字通りの意味に関わるもう一つの分野である。例えば、「寒いです」を「あなたが寒さを経験している」という考えにまとめること。意味論は言語を文字通りの意味、つまり世界や人間の経験に関する情報に還元する。

語用論は、物理的および会話的文脈が音声相互作用に及ぼす相互関係を研究する。例えば、誰かに「寒いです」と言ったとき、その人が半開きの窓を閉めるべきだと知ることなど。語用論には、環境や他のエージェントの意図、隠された知識について推論しながら発話を解釈することが含まれる。

あまり技術的になりすぎずに、ある程度分離されたシステムの直感的な証拠がある。

原文を表示

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd

The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy even our most basic intuitions about it. They have emerged not because they are thoughtful solutions to the problem of intelligence, but because they scaled effectively on hardware we already had. Seduced by the fruits of scale, some have come to believe that it provides a clear pathway to AGI. The most emblematic case of this is the multimodal approach, in which massive modular networks are optimized for an array of modalities that, taken together, appear general. However, I argue that this strategy is sure to fail in the near term; it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination. Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.

Preface: Disembodied definitions of Artificial General Intelligence — emphasis on general — exclude crucial problem spaces that we should expect AGI to be able to solve. A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc. As I will discuss in the next section, what is needed for these problems is a form of intelligence that is fundamentally situated in something like a physical world model. For more discussion on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming.

Why We Need the World, and How LLMs Pretend to Understand It

TLDR: I first argue that true AGI needs a physical understanding of the world, as many problems cannot be converted into a problem of symbol manipulation. It has been suggested by some that LLMs are learning a model of the world through next token prediction, but it is more likely that LLMs are learning bags of heuristics to predict tokens. This leaves them with a superficial understanding of reality and contributes to false impressions of their intelligence.

The most shocking result of the predict-next-token objective is that it yields AI models that reflect a deeply human-like understanding of the world, despite having never observed it like we have. This result has led to confusion about what it means to understand language and even to understand the world — something we have long believed to be a prerequisite for language understanding. One explanation for the capabilities of LLMs comes from an emerging theory suggesting that they induce models of the world through next-token prediction. Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. While I’m generally in support of digging up esoteric texts for research inspiration, I’m worried that this metaphor has been taken too literally. Do LLMs really learn implicit models of the world? How could they otherwise be so proficient at language?

One source of evidence in favor of the LLM world modeling hypothesis is the Othello paper, wherein researchers were able to predict the board of an Othello game from the hidden states of a transformer model trained on sequences of legal moves. However, there are many issues with generalizing these results to models of natural language. For one, whereas Othello moves can provably be used to deduce the full state of an Othello board, we have no reason to believe that a complete picture of the physical world can be inferred by a linguistic description. What sets the game of Othello apart from many tasks in the physical world is that Othello fundamentally resides in the land of symbols, and is merely implemented using physical tokens to make it easier for humans to play. A full game of Othello can be played with just pen and paper, but one can’t, e.g., sweep a floor, do dishes, or drive a car with just pen and paper. To solve such tasks, you need some physical conception of the world beyond what humans can merely say about it. Whether that conception of the world is encoded in a formal world model or, e.g., a value function is up for debate, but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation.

Another issue stated in Melanie Mitchell’s recent piece and supported by this paper, is that there is evidence that generative models can score remarkably well on sequence prediction tasks while failing to learn models of the worlds that created such sequence data, e.g. by learning comprehensive sets of idiosyncratic heuristics. E.g., it was pointed out in this blog post that OthelloGPT learned sequence prediction rules that don’t actually hold for all possible Othello games, like “if the token for B4 does not appear before A4 in the input string, then B4 is empty.” While one can argue that it doesn’t matter how a world model predicts the next state of the world, it should raise suspicion when that prediction reflects a better understanding of the training data than the underlying world that led to such data. This, unfortunately, is the central fault of the predict-next-token objective, which seeks only to retain information relevant to the prediction of the next token. If it can be done with something easier to learn than a world model, it likely will be.

To claim without caveat that predicting the effects of earlier symbols on later symbols requires a model of the world like the ones humans generate from perception would be to abuse the “world model” notion. Unless we disagree on what the world is, it should be clear that a true world model can be used to predict the next state of the physical world given a history of states. Similar world models, which predict high fidelity observations of the physical world, are leveraged in many subfields of AI including model-based reinforcement learning, task and motion planning in robotics, causal world modeling, and areas of computer vision to solve problems instantiated in physical reality. LLMs are simply not running physics simulations in their latent next-token calculus when they ask you if your person, place, or thing is bigger than a breadbox. In fact, I conjecture that the behavior of LLMs is not thanks to a learned world model, but to brute force memorization of incomprehensibly abstract rules governing the behavior of symbols, i.e. a model of syntax.

Syntax is a subfield of linguistics that studies how words of various grammatical categories (e.g. parts of speech) are arranged together into sentences, which can be parsed into syntax trees. Syntax studies the structure of sentences and the atomic parts of speech that compose them.

Semantics is another subfield concerned with the literal meaning of sentences, e.g., compiling “I am feeling chilly” into the idea that you are experiencing cold. Semantics boils language down to literal meaning, which is information about the world or human experience.

Pragmatics studies the interplay of physical and conversational context on speech interactions, like when someone knows to close an ajar window when you tell them “I am feeling chilly.” Pragmatics involves interpreting speech while reasoning about the environment and the intentions and hidden knowledge of other agents.

Without getting too technical, there is intuitive evidence that somewhat separate systems of cognition are responsible for each of these linguistic faculties. Look no further than the capability for humans to generate syntactically well-formed sentences that have no semantic meaning, e.g. Chomsky’s famous sentence “Colorless green ideas sleep furiously,” or sentences with well-formed semantics that make no pragmatic sense, e.g. responding merely with “Yes, I can” when asked, “Can you pass the salt?” Crucially, it is the fusion of the disparate cognitive abilities underpinning them that coalesce into human language understanding. For example, there isn’t anything syntactically wrong with the sentence, “The fridge is in the apple,” as a syntactic account of “the fridge” and “the apple” would categorize them as noun phrases that can be used to produce a sentence with the production rule, S → (NP “is in” NP). However, humans recognize an obvious semantic failure in the sentence that becomes apparent after attempting to reconcile its meaning with our understanding of reality: we know that fridges are larger than apples, and could not be fit into them.

But what if you have never perceived the real world, yet still were trying to figure out whether the sentence was ill-formed? One solution could be to embed semantic information at the level of syntax, e.g., by inventing new syntactic categories, NPthe fridge and NPthe apple , and a single new production rule that prevents semantic misuse: S → (NPthe apple “is in” NPthe fridge ). While this strategy would no longer require grounded world knowledge about fridges and apples, e.g., it would require special grammar rules for every semantically well-formed construction… which is actually possible to learn given a massive corpus of natural language. Crucially, this would not be the same thing as grasping semantics, which in my view is fundamentally about understanding the nature of the world.

Finding that LLMs have reduced problems of semantics and pragmatics into syntax would have profound implications on how we should view their intelligence. People often treat language proficiency as a proxy for general intelligence by, e.g., strongly associating pragmatic and semantic understanding with the cognitive abilities that undergird them in humans. For example, someone who appears well-read and graceful in navigating social interactions is likely to score high in traits like sustained attention and theory of mind, which lie closer to measures of raw cognitive ability. In general, these proxies are reasonable for assessing a person’s general intelligence, but not an LLM’s, as the apparent linguistic skills of LLMs could come from entirely separate mechanisms of cognition.

The Bitter Lesson Revisited

TLDR: Sutton’s Bitter Lesson has sometimes been interpreted as meaning that making any assumptions about the structure of AI is a mistake. This is both unproductive and a misinterpretation; it is precisely when humans think deeply about the structure of intelligence that major advancements occur. Despite this, scale maximalists have implicitly suggested that multimodal models can be a structure-agnostic framework for AGI. Ironically, today’s multimodal models contradict Sutton’s Bitter Lesson by making implicit assumptions about the structure of individual modalities and how they should be sewn together. In order to build AGI, we must either think deeply about how to unite existing modalities, or dispense with them altogether in favor of an interactive and embodied cognitive process.

The paradigm that led to the success of LLMs is marked primarily by scale, not efficiency. We have effectively trained a pile of one trillion ants for one billion years to mimic the form and function of a Formula 1 race car; eventually it gets there, but wow was the process inefficient. This analogy nicely captures a debate between structuralists, who want to build things like "wheels" and "axles" into AI systems, and scale maximalists, who want more ants, years, and F1 races to train on. Despite many deca

この記事をシェア

MarkTechPost重要度42026年7月3日 06:38

RAG-Anything チュートリアル：Colab でテキスト、表、数式、画像を扱うマルチモーダル検索パイプラインの構築方法

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む