TLDR AI·2026年6月18日 09:00·約14分で読める

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

#3D Motion Forecasting #Robotics Planning #Generative AI #AllenAI

TL;DR

AllenAI は、言語指示に基づいて 3D モーションを予測する「MolmoMotion」モデルと大規模データセット「PointMotionBench」を発表し、ロボティクス計画や動画生成への応用可能性を示した。

AI深層分析2026年6月18日 17:06

重要/ 5段階

深度40%

キーポイント

言語誘導型 3D モーション予測の実現

テキストの指示（例：「ボールを投げる」）に基づいて、未来の 3D ポイントクラウドやモーションシーケンスを生成する新しいモデル「MolmoMotion」を開発した。

大規模データセットとベンチマークの公開

学習用および評価用の大規模データセット「PointMotionBench」と、100 万レベルのトレーニングデータを公開し、研究コミュニティへの貢献を促進している。

ロボティクス計画と動画生成への応用

実験により、複雑な環境下でのロボット動作計画や、テキストから高品質な 3D モーションに基づく動画生成など、多様なダウンストリームタスクで高い性能を発揮することを示した。

言語ガイド付きの予測機能

MolmoMotion は動画フレーム、3D 点、そして「果物が入った木製のボウルを動かして回転させる」のような自然言語指示を入力とし、未来の数秒間の物体の動きを予測します。

既存手法を超える性能

このモデルは、3D 空間における点の移動経路を予測する能力において、既存の予測手法よりも大幅に優れたパフォーマンスを発揮します。

効率的な運動表現とアーキテクチャ

MolmoMotion は、完全な動画レンダリングのコストを避けつつ、クラス非依存・視点安定した物体付随の3D点として運動を表し、言語指示と画像内の点を接続する Molmo 2 をバックボーンに使用します。

2 つの予測バリアント

明確な未来経路には滑らかさを重視した自己回帰型（AR）を、複数の可能性がある不確実性のある状況には連続空間でのノイズ変換を用いたフローマッチング型（FM）を採用しています。

影響分析・編集コメントを表示

影響分析

この発表は、単なる動作予測を超えて「言語理解」と「物理的推論」を統合した新しい AI パラダイムを示しており、自律ロボットの計画能力や没入型コンテンツ制作に大きな転換点をもたらす可能性があります。特にオープンソースでデータセットとモデルが公開されることで、研究の民主化と実社会への迅速な適用が加速すると予想されます。

編集コメント

言語指示で物理動作を予測する技術は、ロボティクス分野のボトルネック解消に直結する画期的な進展です。オープンソースでの大規模データセット公開により、今後数ヶ月以内に実装事例が多数現れると予想されます。

記事一覧に戻る

MolmoMotion: 内部構造の解説
MolmoMotion-1M と PointMotionBench の紹介
実験とパフォーマンス：3D モーション予測
下流タスク評価：ロボティクス計画
下流タスク評価：動画生成
限界と今後の展望

🧠 モデル：https://huggingface.co/collections/allenai/molmomotion | 📄 テクニカルレポート：https://allenai.org/papers/molmomotion | 📊 データセット：https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 コード：https://github.com/allenai/molmo-motion.git | 🌐 プロジェクトページ：https://molmomotion.github.io/

機械は運動を知覚する能力において驚くほど向上しました。動画を与えられれば、現代のモデルは物体や点がシーン内をどのように移動するかを極めて高い信頼度で追跡できます。しかし、知覚という行為は本質的に回顧的なものです：それはすでに起こった運動を説明するものです。私たちが構築したい多くのシステムやアプリケーションは、むしろ「未来を見据える」必要があります。コップに手を伸ばすロボットは、触れる前にそのコップがどのように動くかを予測しなければなりません。物理的に妥当なフレームを生成するためには、動画生成モデルも次にどのような現実的な運動が続くのかを知っていなければなりません。

運動を予測することは、それを観察するよりも難しいですが、多くのシナリオでははるかに有用です。

この考えが、本日公開する新しい運動予測モデルであるMolmoMotionの動機となりました。動画フレーム、物体上にマークされた 3D ポイント、そして意図する動作を記述したテキスト指示（例：「テーブルの上にある果物が入った木製のボウルを移動させ、回転させる」）が与えられた場合、MolmoMotion はこれらのポイントが未来の数秒間に 3D 空間上でどこへ移動するかを予測します。これは既存の予測手法よりも大幅に優れた性能を実現しています。

動画を見る

*RGB 観測データ、物体上の一連のクエリポイント、および動作記述が与えられた場合、MolmoMotion は物体の将来の 3D ポイント軌道を予測します。これらの予測された軌道は、その後のロボット計画や軌道条件付き動画生成などの応用を駆動するために使用できます。

モデルとともに、116 万本の動画から抽出された、動作記述とペアになった 3D ポイント軌道の最大規模コレクションであるMolmoMotion-1Mも公開します。また、物体中心の 3D 運動予測精度を測定するために設計された人間検証済みベンチマークであるPointMotionBench（2,700 本の動画クリップを含む）も公開します。

MolmoMotion のような運動予測モデルは、ロボット計画から制御可能な動画生成に至るまで、さまざまな下流タスクにおいて有用であることがわかりました。私たちは、コミュニティが研究・改善・カスタマイズできるよう、モデルの重み、MolmoMotion-1M データセット、および PointMotionBench ベンチマークをオープンに公開します。

MolmoMotion: 内部構造

MolmoMotion は、運動を意図的かつ極めて効率的な方法で表現します。具体的には、世界空間における物体に付随する 3D ポイントとして表現し、フル動画のレンダリングコストをかけずに運動を捉えます。このアプローチを選んだのは、以下の 3 つの性質を持つ汎用的な運動表現が必要だったからです。

クラス非依存：人間の体や手、剛体オブジェクト、あるいはその他の固定カテゴリに紐づくテンプレートに縛られないこと。
視点安定性：同じ物理的運動が、カメラや視点を変えても一貫して表現されること。
下流システムによる直接利用：物理的な運動について推論を必要とする下流システムでそのまま使用可能であること。

私たちが検討した表現の中で、これら 3 つの条件をすべて満たしたのはこれだけでした。表面点のスパースなセットは、移動するオブジェクトの種類を仮定することなく、剛体・関節運動・（一定の範囲内で）変形運動を記述できます。これらの点は共有された世界座標系に存在するため、カメラの動きや視点の変化に対してその軌道が安定します。また、これらは 3D 空間におけるコンパクトで明示的な軌道であるため、ロボットポリシーや動画生成モデルなどのシステムへ直接渡すことが可能です。

これらの軌道を予測するために、MolmoMotion は Molmo 2 をバックボーンとして使用し、画像内のオブジェクトやポイントと言語指示を結びつけることを可能にしています。短いビデオ履歴、アクションの説明、および初期の 3D 位置を持つ一連のクエリポイントを前提として、モデルはまず参照されているオブジェクト、クエリポイント、そして指示が記述する運動を特定します。その後、各ポイントの将来の 3D 軌道を予測します。

MolmoMotion の 2 つの変種を訓練しています:

オートレジレス変種（MolmoMotion-AR）は、未来の座標を段階的に予測します。これは VLM が使用する座標スタイルの予測に従い、3D 座標を構造化されたテキストとして表現し、時間順に未来の軌道を書き出します。各新しい座標がすでに生成された軌道に条件付けられているため、これは滑らかなロールアウトを促し、将来の経路が明確に定義されている場合に最も高い精度をもたらします。

フローマッチング変種（MolmoMotion-FM）は、ノイズから運動への変換によって連続的な 3D 空間内で軌道を予測するため、指示が複数の妥当な未来を許容する場合の不確実性を表現するのに適しています。

*MolmoMotion のアーキテクチャ。Molmo 2 バックボーンへの共通入力には、RGB 観測の画像トークン、アクション記述のテキストトークン、および Molmo 2 ビジョンエンコーダからサンプリングされた 2D クエリポイント特徴トークンが含まれます。MolmoMotion-AR は初期の 3D クエリ座標を符号化し、未来の軌道を経量化された座標テキストとしてデコードしますが、MolmoMotion-FM はそれらを連続的な 3D 座標空間で直接表現します。

MolmoMotion-1M と PointMotionBench の紹介

MolmoMotion を訓練するには、まだ存在しなかったデータが必要でした：特定のオブジェクトに grounded（基盤付け）された 3D ポイント軌道を持ち、アクション記述とペアになった大規模なビデオです。既存の 3D-トラックデータセットは小規模でドメインが限定されており、インターネット上のビデオには MolmoMotion のような予測器に必要なスケールと多様性がすべて含まれていますが、3D アノテーションが含まれていませんでした。そこで、制約のないビデオからオブジェクト grounded な 3D 軌道を抽出する自動パイプラインを構築しました。

入力動画とその動作説明を基に、当社の注釈パイプラインはメトリックな世界座標系における物体 grounded な 3D ポイント軌道データを生成します。（以下の図では各工程を示しています。）課題となるのは、制約のない動画から得られる生トラッキングデータがノイズを含み、深度や追跡誤差によりポイントがジッターしたりドリフトしたりすること、および多くの物体が動画の大部分で静止している点です。データの信頼性を高めるため、物体の他の部分と一貫して移動しないポイントをフィルタリングし、残りの軌道を平滑化し、各クリップを物体が実際に運動するウィンドウにセグメント化します。

大規模にパイプラインを実行した結果、MolmoMotion-1M が得られました。これは現時点で収集された中で最大規模の、動作説明付きかつ物体 grounded な 3D ポイント軌道データのコーパスであり、736 の運動タイプと 5.6K の異なる物体にわたります。

動画を見る

*当社のデータ注釈パイプラインの概要。動作イベントとその説明を持つ動画に対して、まず移動する物体を特定し、その上にクエリポイントをサンプリングします。次に、物体上の高密度な 2D ポイントを追跡し、これらのトラッキングデータを共通のメトリックな 3D フレームに変換（lift）し、物体レベルの空間的・時間的一貫性事前情報を用いて信頼性の低い軌道をフィルタリングします。最後に、特定された物体が意味のある運動を行う区間を中心に動画をクリップします。*

*上部の指示：テーブルの上にある果物が入った木製のボウルを動かし、回転させてください。 下部の指示：青い布の上にランナーローラーを転がしてください。*

*上部の指示：銀色の車が道路に沿って走り、ゆっくりと右に曲がります。 下部の指示：フラミンゴが右へ歩きながら嘴を水に浸します。*

MolmoMotion の予測性能を評価するために、私たちは PointMotionBench を構築しました。これは保持された 3D 軌道に基づく人間検証済みベンチマークです。このベンチマークは、屋内での操作、自己中心の手と物体の相互作用、屋外の動的シーンを含む、111 の物体カテゴリと 61 の運動タイプにわたる 2.7K クリップをカバーしています。各クリップにおいて、モデルには現在の観測データ、物体クエリポイント、およびアクション記述が与えられ、予測された 3D ポイント軌道が物体の実際の未来運動とどれだけ正確に一致するかという点で評価されます。これにより、生成されたポイントトラックが単に妥当に見えるかどうかを頼りにするのではなく、3D 運動予測に対する直接的な定量的テストが可能になります。

実験とパフォーマンス

MolmoMotion を 3 つの観点から評価します。第一に、既存の方法よりも未来の 3D 運動をより正確に予測できるかという点です。第二に、運動に関する学習内容がロボットによる操作タスクの実行を支援できるかどうかという点です。第三に、同じ知識が生成された動画における運動のガイドとして機能し得るかどうかという点です。

3D 運動予測

On PointMotionBench において、MolmoMotion は、ピクセル空間のビデオ生成器、パラメトリックな 3D 手法、単純な等速度ベースラインを含む、テストした既存のすべての 3D モーション予測手法を上回りました。これは多様なオブジェクト、シーン、アクションにわたる結果です。

MolmoMotion は、布地の上で lint roller が前後に動く様子や、テーブルの上でボウルが滑りながら回転する様子、水たまりの中でくちばしを浸しながら右へ歩くフラミンゴの動き、あるいはカーブする道路に沿って走る車の動きなど、多くの種類のオブジェクトおよびシーンのモーションを予測できます。いずれの場合も、予測された経路は MolmoMotion に対して与えられた指示に従い、ベンチマークにおけるグランドトゥルース（正解）モーションと極めて近い位置に留まります。

Downstream evaluation: robotics planning

MolmoMotion が学習したモーションに関する知見は、異なる設定間でも転移可能です。人間の手でカップを持ち上げる行為とロボットグリッパーで持ち上げる行為は非常に異なりますが、カップ自体が 3D 空間を通る経路は類似しています。このため、オブジェクトを移動させる前にその動きを計画する必要があるロボティクスにおいて、MolmoMotion は自然な適合性を示します。

実世界のロボット操作ビデオの大規模オープンデータセットである DROID でファインチューニングを行った結果、MolmoMotion が幅広いロボット計画シナリオにおいて、異なるオブジェクト、カメラ視点、シーン、タスクに対して、合理的なオブジェクト経路を予測できることがわかりました。

View video

動画を見る

*上部の指示：「容器から布を取り出してください。」 下部の指示：「鍋の蓋を動かしてください。」*

シミュレーションにおいて、MolmoMotion を基盤とした制御ポリシーは、ピッキング＆プレースタスクの 76.3% で成功し、同じく Molmo 2 を基盤としたポリシーの 56.0% を上回ります。また学習速度も速く、10K トレーニングステップ後に 51% に達するのに対し、Molmo 2 バージョンは最大でも 19% で頭打ちになります。実機ロボット（ファインチューニング後）では、MolmoMotion は約 2K ステップで、Molmo 2 ベースラインが 12K トレーニングステップ後に達成するのと同じテスト L2 エラー値に到達します。

下流タスク評価：動画生成

動画を見る

*指示：「フラミンゴが右へ歩きながら嘴を水に浸します。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

動画を見る

*指示：「テーブルから丸い茶色の皿を取ってください。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

MolmoMotion が予測した経路は、ビデオ生成の誘導にも活用できます。テキスト指示のみから画像から動画へのモデルが動きを推測させるのではなく、MolmoMotion の予測結果を入力として与えることで、プロンプトでは曖昧にしか記述できない小さな精密な動きにおいても、要求された動作により忠実に従った生成ビデオを実現できます。

この効果を裏付ける指標もあります。動画生成器の誘導に用いた際、MolmoMotion は測定した 5 つの運動関連指標すべてにおいてベースモデルよりも運動品質を向上させ、さらにはるかに大規模な画像から動画へのモデルに対しては 5 つのうち 4 つで上回りました。

限界と今後の展望

MolmoMotion は能力のあるモデルですが、いくつかの限界にも注意が必要です。トレーニング中は物体ごとに 8 つのクエリポイントを使用しており、有用な軌道を予測するには十分ですが、表面幾何学を密に表現するには不十分です。このため、複雑な変形運動への対応には制限が生じます。

私たちは、世界内の物体が動く「前に」その動きを予期する予測（forecasting）は、すでに存在するものを知覚することと同様に、機械知能にとって根本的な要素であると信じています。MolmoMotion はその一歩であり、カテゴリごとのテンプレートなしで物体カテゴリ全体に一般化し、通常の動画から学習され、PointMotionBench で測定した中で最も精度の高い 3D 運動予測器です。ロボット工学、ビデオ生成、そしてそれらを超えた分野において、多くの応用が現れることを期待しています。

MolmoMotion を試していただくために、重みのダウンロード、トレーニングデータの検証、そして PointMotionBench に対する当社の手法の評価をお勧めします。

原文を表示

Back to Articles

MolmoMotion: Under the hood
Introducing MolmoMotion-1M and PointMotionBench
Experiments and performance 3D motion forecasting
Downstream evaluation: robotics planning
Downstream evaluation: video generation

Limitations and what's next

🧠 Models: https://huggingface.co/collections/allenai/molmomotion | 📄 Tech Report: https://allenai.org/papers/molmomotion | 📊 Data: https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github.com/allenai/molmo-motion.git | 🌐 Project Page: https://molmomotion.github.io/

Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to *look forward* instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames.

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

This idea was the motivation behind MolmoMotion, a new motion forecasting model we're releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

View video

*Given an RGB observation, a set of query points on an object, and an action description, MolmoMotion predicts the object's future 3D point trajectory. These predicted trajectories can then drive downstream applications such as robotics planning and trajectory-conditioned video generation.*

Alongside the model, we're publishing MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We're also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips.

We find that motion forecasters like MolmoMotion can be useful across a range of downstream tasks, from robot planning to controllable video generation. We're releasing the model weights, the MolmoMotion-1M dataset, and our PointMotionBench benchmark openly for the community to study, improve, and customize.

MolmoMotion: Under the hood

MolmoMotion represents motion in a deliberate, highly efficient way: as object-attached 3D points in world space, which capture motion without the cost of rendering full video. We chose it because we needed a general motion representation with three properties:

Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.

View-stable: the same physical motion should be represented consistently across cameras and viewpoints.

Directly usable by downstream systems that need to reason about physical motion.

Among the representations we considered, it was the only one that satisfied all three. A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change. And because they're compact explicit trajectories in 3D space, they can be passed directly to systems such as robot policies or video generation models.

To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. Given a short video history, an action description, and a set of query points with their initial 3D positions, the model first identifies the object being referred to, the query points, and the motion the instruction describes. It then predicts the future 3D trajectory of each point.

We train two variants of MolmoMotion:

The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.

The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

*The MolmoMotion architecture. The shared input to the Molmo 2 backbone consists of image tokens of RGB observations, text tokens of action description, and 2D query point feature tokens sampled from the Molmo 2 vision encoder. MolmoMotion-AR encodes the initial 3D query coordinates and decodes future trajectories as quantized coordinate text, while MolmoMotion-FM represents them directly in continuous 3D coordinate space.*

Introducing MolmoMotion-1M and PointMotionBench

To train MolmoMotion, we needed data that didn’t yet exist: large-scale videos with 3D point trajectories grounded to specific objects and paired with action descriptions. Existing 3D-track datasets were small and domain-limited, and while internet videos have all the scale and diversity we wanted for a forecaster like MolmoMotion, they didn’t include 3D annotations. So we built an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video.

Given an input video and its action description, our annotation pipeline produces object-grounded 3D point trajectories in metric world coordinates. (The figure below shows each stage.) The challenging part is that raw tracks from unconstrained video are noisy – with depth and tracking errors that leave points jittering and drifting – and that objects often stay still for much of a video. To make the data more trustworthy, we filter out points that don't move coherently with the rest of the object, smooth the remaining trajectories, and segment each clip to the window where the object actually moves.

Running our pipeline at scale yielded MolmoMotion-1M—to our knowledge the largest corpus of action-described, object-grounded 3D point trajectories assembled to date, spanning 736 motion types and 5.6K distinct objects.

View video

*An overview of our data annotation pipeline. Given a video of an action event and its description, we first ground the moving object and sample query points on it. We then track dense 2D points on the object, lift these tracks into a shared metric 3D frame, and use object-level spatial and temporal consistency priors to filter unreliable trajectories. Finally, we clip the video around intervals where the grounded object undergoes meaningful motion.*

*Top instruction: "Move and rotate wooden bowl with fruits on the table." Bottom instruction: "Roll a lint roller on a blue cloth."*

*Top instruction: "A silver car follows the road and slowly turns to the right." Bottom instruction: "A flamingo dips its beak into the water while walking to the right."*

To evaluate MolmoMotion’s forecasting performance, we also built PointMotionBench, a human-validated benchmark of held-out 3D trajectories. It covers 2.7K clips spanning 111 object categories and 61 motion types, including indoor manipulation, egocentric hand-object interaction, and outdoor dynamic scenes. For each clip, models are given the current observation, object query points, and an action description, and are evaluated on how accurately their predicted 3D point trajectories match the object’s actual future motion. This gives us a direct quantitative test of 3D motion forecasting rather than relying on whether a generated point track merely looks plausible.

Experiments and performance

We evaluate MolmoMotion in three ways. First, we test whether it forecasts future 3D motion more accurately than existing methods. Second, we test whether what it has learned about motion helps a robot carry out manipulation tasks. Third, we test whether that same knowledge can help guide the motion in generated video.

3D motion forecasting

On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods we tested – including pixel-space video generators, parametric 3D methods, and a simple constant-velocity baseline – across a range of objects, scenes, and actions.

MolmoMotion can forecast many kinds of object and scene motions, like how a lint roller will move back and forth on cloth, how a bowl will slide and rotate on a table, how a flamingo will walk to the right while dipping its beak in a body of water, or how a car will follow a road as it turns. In each case, the predicted path follows the instruction MolmoMotion was given and stays extremely close to the ground truth motion in our benchmark.

Downstream evaluation: robotics planning

What MolmoMotion learns about motion should carry over from one setting to another—lifting a cup with a human hand and lifting it with a robot gripper are very different actions, but the cup itself follows a similar path through 3D space. That makes MolmoMotion a natural fit for robotics, where a robot has to plan how objects should move before moving them.

After fine-tuning on DROID, a large open dataset of real-world robot manipulation videos, we find that MolmoMotion can predict sensible object paths across different objects, camera viewpoints, scenes, and tasks for a wide range of robot planning scenarios.

View video

*Top instruction: “Take cloth out of container." Bottom instruction: “Move lid on pot.”*

In simulation, a control policy built on MolmoMotion succeeds on 76.3% of pick-and-place tasks versus 56.0% for the same policy built on Molmo 2—and it learns faster, reaching 51% after 10K training steps where the Molmo 2 version tops out at 19%. On real robots (after fine-tuning), MolmoMotion reaches the same test L2 error that the Molmo 2 baseline achieves after 12K training steps in only about 2K steps.

Downstream evaluation: video generation

View video

*Instruction: “A flamingo dips its beak into the water while walking to the right.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

View video

*Instruction: "Take the round light brown plate from the table.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

MolmoMotion's predicted paths can also steer video generation. Instead of letting an image-to-video model guess motion from a text instruction alone, you can feed in MolmoMotion's predictions. The result is generated video that follows requested actions more closely, especially for small and precise movements a prompt can only describe vaguely.

The metrics back this up. Used to guide a video generator, MolmoMotion improves motion quality over the base model on all five motion-related metrics we measure, and beats a much larger image-to-video model on four of the five.

Limitations and what's next

MolmoMotion is a capable model, but there are still some limitations to note. It uses eight query points per object during training—enough to forecast a useful trajectory but not enough to densely represent surface geometry. This limits the model's handling of complex deformable motion.

We think forecasting – anticipating how objects in the world will move *before* they move – is as fundamental to machine intelligence as perceiving what's already there. MolmoMotion is a step toward this—3D motion prediction that generalizes across object categories without per-category templates, learned from ordinary video, and the most accurate 3D motion forecaster we've measured on PointMotionBench. We expect many applications will follow in robotics, video, and beyond.

We encourage you to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench.

この記事をシェア

Hugging Face Blog★42026年6月18日 00:26

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

Hugging Face が、言語による指示で 3D の動きを予測する新しいモデル「MolmoMotion」を発表しました。この技術は、テキスト入力から物体やキャラクターの未来の動作を推定する能力を持ちます。

Allen AI (AI2)★42026年6月17日 17:00

MolmoMotion：言語指示に基づく 3D モーション予測モデルの公開

Allen AI は、物体の未来の動きを予測するオープンソースの言語ガイド付き 3D モデル「MolmoMotion」を発表した。この技術はロボット工学や動画生成など、次なる出来事を推論するシステムにおける運動予測能力を強化する。

AWS Machine Learning Blog★42026年6月19日 23:05

Adobe Marketing Agent for Amazon Quick によるキャンペーンワークフローの加速

AWS と Adobe は、Amazon Quick と Adobe Marketing Agent を連携させることで、マーケティングチームが自然言語で質問するだけで、ガバナンスされた会話環境内で数秒以内にキャンペーンのパフォーマンスやオーディエンスに関するインサイトにアクセスできるようにした。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月18日 09:00·約14分で読める

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

#3D Motion Forecasting #Robotics Planning #Generative AI #AllenAI

TL;DR

AI深層分析2026年6月18日 17:06

重要/ 5段階

深度40%

キーポイント

言語誘導型 3D モーション予測の実現

大規模データセットとベンチマークの公開

ロボティクス計画と動画生成への応用

言語ガイド付きの予測機能

既存手法を超える性能

このモデルは、3D 空間における点の移動経路を予測する能力において、既存の予測手法よりも大幅に優れたパフォーマンスを発揮します。

効率的な運動表現とアーキテクチャ

2 つの予測バリアント

影響分析・編集コメントを表示

影響分析

編集コメント

記事一覧に戻る

MolmoMotion: 内部構造の解説
MolmoMotion-1M と PointMotionBench の紹介
実験とパフォーマンス：3D モーション予測
下流タスク評価：ロボティクス計画
下流タスク評価：動画生成
限界と今後の展望

運動を予測することは、それを観察するよりも難しいですが、多くのシナリオでははるかに有用です。

動画を見る

MolmoMotion: 内部構造

クラス非依存：人間の体や手、剛体オブジェクト、あるいはその他の固定カテゴリに紐づくテンプレートに縛られないこと。
視点安定性：同じ物理的運動が、カメラや視点を変えても一貫して表現されること。
下流システムによる直接利用：物理的な運動について推論を必要とする下流システムでそのまま使用可能であること。

MolmoMotion の 2 つの変種を訓練しています:

オートレジレス変種（MolmoMotion-AR）は、未来の座標を段階的に予測します。これは VLM が使用する座標スタイルの予測に従い、3D 座標を構造化されたテキストとして表現し、時間順に未来の軌道を書き出します。各新しい座標がすでに生成された軌道に条件付けられているため、これは滑らかなロールアウトを促し、将来の経路が明確に定義されている場合に最も高い精度をもたらします。

フローマッチング変種（MolmoMotion-FM）は、ノイズから運動への変換によって連続的な 3D 空間内で軌道を予測するため、指示が複数の妥当な未来を許容する場合の不確実性を表現するのに適しています。

MolmoMotion-1M と PointMotionBench の紹介

動画を見る

*上部の指示：銀色の車が道路に沿って走り、ゆっくりと右に曲がります。 下部の指示：フラミンゴが右へ歩きながら嘴を水に浸します。*

実験とパフォーマンス

3D 運動予測

Downstream evaluation: robotics planning

View video

動画を見る

*上部の指示：「容器から布を取り出してください。」 下部の指示：「鍋の蓋を動かしてください。」*

下流タスク評価：動画生成

動画を見る

*指示：「フラミンゴが右へ歩きながら嘴を水に浸します。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

動画を見る

*指示：「テーブルから丸い茶色の皿を取ってください。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

限界と今後の展望

原文を表示

Back to Articles

MolmoMotion: Under the hood
Introducing MolmoMotion-1M and PointMotionBench
Experiments and performance 3D motion forecasting
Downstream evaluation: robotics planning
Downstream evaluation: video generation

Limitations and what's next

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

View video

MolmoMotion: Under the hood

Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.

View-stable: the same physical motion should be represented consistently across cameras and viewpoints.

Directly usable by downstream systems that need to reason about physical motion.

We train two variants of MolmoMotion:

The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.

The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

Introducing MolmoMotion-1M and PointMotionBench

View video

*Top instruction: "Move and rotate wooden bowl with fruits on the table." Bottom instruction: "Roll a lint roller on a blue cloth."*

*Top instruction: "A silver car follows the road and slowly turns to the right." Bottom instruction: "A flamingo dips its beak into the water while walking to the right."*

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

View video

*Top instruction: “Take cloth out of container." Bottom instruction: “Move lid on pot.”*

Downstream evaluation: video generation

View video

*Instruction: “A flamingo dips its beak into the water while walking to the right.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

View video

*Instruction: "Take the round light brown plate from the table.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

Limitations and what's next

We encourage you to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench.

この記事をシェア

Hugging Face Blog★42026年6月18日 00:26

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

Allen AI (AI2)★42026年6月17日 17:00

MolmoMotion：言語指示に基づく 3D モーション予測モデルの公開

AWS Machine Learning Blog★42026年6月19日 23:05

Adobe Marketing Agent for Amazon Quick によるキャンペーンワークフローの加速

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

キーポイント

影響分析

編集コメント

MolmoMotion: 内部構造

MolmoMotion-1M と PointMotionBench の紹介

実験とパフォーマンス

3D 運動予測

Downstream evaluation: robotics planning

下流タスク評価：動画生成

限界と今後の展望

MolmoMotion: Under the hood

Introducing MolmoMotion-1M and PointMotionBench

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

Downstream evaluation: video generation

Limitations and what's next

関連記事

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

キーポイント

影響分析

編集コメント

MolmoMotion: 内部構造

MolmoMotion-1M と PointMotionBench の紹介

実験とパフォーマンス

3D 運動予測

Downstream evaluation: robotics planning

下流タスク評価：動画生成

限界と今後の展望

MolmoMotion: Under the hood

Introducing MolmoMotion-1M and PointMotionBench

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

Downstream evaluation: video generation

Limitations and what's next

関連記事