The Decoder·2026年4月12日 21:09·約1分で読める

研究者が世界モデルの定義を明確化、テキスト動画生成AIは除外

#世界モデル #生成AI #研究標準化 #Sora #マルチモーダル #概念定義

TL;DR

国際研究チームは世界モデル研究の定義を整理するOpenWorldLibを発表し、Soraなどのテキストから動画を生成するモデルをその定義から明確に除外した。

AI深層分析2026年4月12日 22:40

注目/ 5段階

深度40%

キーポイント

世界モデル研究の定義整理

国際研究チームが断片的な世界モデル研究の状況に秩序をもたらすため、OpenWorldLibを提案している。

テキストから動画を生成するモデルの除外

研究チームの定義では、Soraなどのテキストから動画を生成するモデルは世界モデルに含まれないと明確にされている。

研究コミュニティへの影響

この定義付けにより、世界モデル研究の方向性や評価基準がより明確になる可能性がある。

影響分析・編集コメントを表示

影響分析

この記事は、AI研究コミュニティ内で重要な概念定義の議論が進んでいることを示している。世界モデルと生成AIモデルの区別を明確にすることで、今後の研究方向性や評価基準に影響を与える可能性がある。

編集コメント

研究コミュニティにおける概念定義の重要性を再認識させる記事。急速に進化するAI分野で、基礎概念の整理が技術発展の土台となる好例。

国際的な研究チームは、断片的な世界モデル研究の状況に秩序をもたらすべく、OpenWorldLibを提案しています。Soraのようなテキストから動画生成モデルは、その定義から明示的に除外されています。

本記事「研究者が世界モデルの定義を明確化、テキストから動画生成モデルは除外」は、The Decoderで最初に公開されました。

原文を表示

An international research team wants to bring order to the fragmented world model research landscape with OpenWorldLib. Text-to-video models like Sora are explicitly left out of their definition.

The term "world model" comes up constantly in AI research, but nobody has agreed on what actually counts as one. A team from Peking University, Kuaishou Technology (the company behind the Kling video generator), the National University of Singapore, Tsinghua, and other institutions wants to fix that with OpenWorldLib. Their paper lays out both a standardized definition and a unified open-source framework that pulls various world model tasks together in one place.

The way the researchers see it, a world model has to be grounded in perception, able to interact with its environment, and capable of long-term memory, all so it can understand and predict how a complex world behaves. A world model is defined by its ability to take in multimodal input from the real world and use it to analyze and respond to its surroundings, regardless of what it outputs.

Why Sora doesn't make the cut as a world model

The paper's most provocative call concerns text-to-video generation. When OpenAI rolled out its now-discontinued Sora video model, plenty of people called it a "world simulator." Deepmind CEO Demis Hassabis made similar claims about Google's Veo video model, positioning it as a step toward world models.

The authors flat-out disagree, landing on the same side as Yann LeCun: while video generation shows some grasp of physical relationships, it's missing the crucial feedback loop with the real world. A model that only generates videos from text doesn't perceive its environment and doesn't interact with it. Text-to-video therefore falls "outside the core tasks of world models," the paper states.

The researchers also cut code generation, web search, and avatar video generation from the definition. Avatar videos, for example, are geared toward entertainment and have little to do with understanding the physical world.

Simulation environments like LIBERO and AI2-THOR test whether world models can turn voice instructions into physically plausible movement sequences. | Image: OpenDCAI

Real-world models need interaction, not passive generation

Rather than passive media generation, the researchers zero in on three task areas:

In interactive video generation, a model predicts the next frame based on previous frames and user input. Unlike text-to-video, it reacts to actions like control commands or camera movements.

Multimodal reasoning covers the ability to figure out spatial, temporal, and causal relationships from images, videos, and audio, like understanding where an object is or why something happened.

In vision-language-action, the model converts visual input and voice instructions into specific movement commands for robotic arms or self-driving vehicles.

The researchers also view 3D reconstruction and simulators as key building blocks. These provide a testable environment where physical rules can be strictly enforced. Plain video prediction, by comparison, only gives a visual guess at the future without guaranteeing physical consistency.

World models can represent the world implicitly through learned internal dynamics or explicitly through 3D simulators. | Image: OpenDCAI

Five modules make up a single pipeline

The OpenWorldLib software project packages these capabilities in a modular setup. An operator module converts all kinds of inputs—text, images, sensor data—into a standardized format. The Synthesis module generates images, videos, audio, and control commands. The Reasoning module handles spatial, visual, and acoustic context. A representation module builds 3D reconstructions and simulation environments. And the memory module stores interaction sequences so the system stays consistent across multiple steps.

The OpenWorldLib pipeline processes multimodal input through five modules and stores interaction histories in a dedicated memory module. | Image: OpenDCAI

A top-level pipeline orchestrates all the modules and exposes a standardized interface. That way, researchers can compare different models and methods in the same framework instead of spinning up custom infrastructure every time.

Hunyuan-WorldPlay and Cosmos top early benchmarks

Running evaluations on Nvidia's A800 and H200 GPUs, the researchers compared existing models inside their framework. Hunyuan-WorldPlay scored the highest visual quality in interactive video generation for navigation scenes.

Nvidia's Cosmos came out on top in complex interactive scenarios where the model had to handle a wide range of user inputs. Older approaches like Matrix-Game-2 were faster but showed noticeable color drift in longer sequences.

Hunyuan-WorldPlay delivers the best visual quality in navigation video generation, while Cosmos leads in complex interactions. | Image: OpenDCAI

Models like VGGT and InfiniteVGGT showed clear weaknesses in 3D scene reconstruction. Significant camera movement led to geometric inconsistencies and blurry textures. Even so, the researchers consider 3D generation essential to the future of world models.

Models like VGGT and InfiniteVGGT still struggle with 3D reconstruction when the camera angle changes significantly. | Image: OpenDCAI

Today's chip designs may be holding world models back

The authors also take aim at current hardware, arguing that today's chips are fundamentally mismatched with what world models need. Modern processors are built to handle individual tokens, so even when a model needs to predict entire video frames, the data still gets crunched token by token internally. In the researchers' view, that's wildly inefficient for the kind of data-heavy perception a real-world model demands. They say new chip architectures are needed, and possibly a move away from the Transformer, which currently powers nearly every large AI model.

As a practical stopgap, the authors point to current vision-language models like Bagel, which handles both multimodal reasoning and image generation on the Qwen architecture. In their view, this shows that language models pre-trained on internet data can in principle deliver all the necessary capabilities—even if building a complete world model is still a long way off. OpenWorldLib is available as an open-source project on GitHub.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

この記事をシェア

Ars Technica AI★42026年5月5日 04:03

教育におけるChatGPTの有用性を主張した研究が撤回される

Springer Nature は、OpenAI の ChatGPT が学習成果にプラスの影響を与えると主張した研究について、分析上の不整合と結論への信頼性欠如を理由に撤回を発表しました。この論文は出版後約1年で数百件の引用を集め、SNS でも話題となりましたが、著者による注目すべき主張には問題があったことが判明しました。

404 Media★42026年5月5日 02:56

Nature が ChatGPT の教育効果に関する論文を撤回

学術誌 Nature は、AI が学生の学習成績や思考力にプラスの影響を与えると主張したメタ分析論文を撤回しました。この論文は 5 月に発表され、ChatGPT の教育的利点を示す根拠として引用されていましたが、調査の結果問題が発覚し取り下げられました。

TLDR AI★32026年5月4日 09:00

OpenAI、Codex にアニメーションペットと設定ファイル自動インポート機能を追加

OpenAI は開発ツール「Codex」を更新し、画面にオーバーレイ表示されるアニメーションペット機能や、他コードエージェントからの設定ファイル自動インポート、音声入力精度向上のための辞書機能を追加した。これによりデスクトップアプリとしての利便性と魅力が強化された。

ニュース一覧に戻る元記事を読む