Hugging Face Blog·2026年6月18日 00:26·約14分で読める

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

#3D Motion Forecasting #Generative AI #Multimodal #Hugging Face

TL;DR

Hugging Face は、テキスト入力から物体やキャラクターの未来動作を推定する言語指示対応型 3D モーション予測モデル「MolmoMotion」を発表した。

AI深層分析2026年6月17日 16:04

重要/ 5段階

深度40%

キーポイント

言語による 3D モーション制御の実現

テキスト入力（自然言語）を受け取り、物体やキャラクターの未来の動きを予測・生成する能力を備えた新しいモデルである。

Hugging Face による技術発表

オープンソースコミュニティで知られる Hugging Face が、この新技術をブログを通じて発表した。

マルチモーダル・生成 AI の進展

テキスト理解と 3D 空間の動き予測を統合することで、AI の表現能力がさらに拡張されたことを示している。

影響分析・編集コメントを表示

影響分析

この発表は、テキストベースの指示と空間的な動き予測を統合する技術的ブレークスルーであり、ゲームやシミュレーション分野におけるコンテンツ生成の効率化に大きく寄与します。特に、人間の言語理解能力を活用した直感的な 3D モーション制御が可能になることで、開発コストの削減と創造性の拡大が期待されます。

編集コメント

テキストから 3D 動作を直接予測できる技術は、クリエイティブ分野における AI の実用性を飛躍的に高める可能性を秘めています。今後のモデルの精度向上とオープンソース化の動向に注目です。

記事一覧に戻る

MolmoMotion: 内部構造の解説
MolmoMotion-1M と PointMotionBench の紹介
実験とパフォーマンス：3D モーション予測
下流タスク評価：ロボティクス計画
下流タスク評価：動画生成
限界と今後の展望

🧠 モデル：https://huggingface.co/collections/allenai/molmomotion | 📄 テクニカルレポート：https://allenai.org/papers/molmomotion | 📊 データセット：https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 コード：https://github.com/allenai/molmo-motion.git | 🌐 プロジェクトページ：https://molmomotion.github.io/

機械は運動を知覚する能力において驚くほど向上しました。動画を与えられれば、現代のモデルは物体や点がシーン内をどのように移動するかを極めて高い信頼度で追跡できます。しかし、知覚という行為は本質的に回顧的なものです：それはすでに起こった運動を説明するものです。私たちが構築したい多くのシステムやアプリケーションは、むしろ「未来を見据える」必要があります。コップに手を伸ばすロボットは、触れる前にそのコップがどのように動くかを予測しなければなりません。物理的に妥当なフレームを生成するためには、動画生成モデルも次にどのような現実的な運動が続くのかを知っていなければなりません。

動きを予測することは、それを観察するよりも難しいですが、多くのシナリオでははるかに有用です。

この考えが、本日公開する新しい運動予測モデルであるMolmoMotionの動機となりました。動画フレーム、物体上にマークされた 3D ポイント、そして意図的な動作を記述したテキスト指示（例：「テーブルの上で果物が入った木製のボウルを移動させ、回転させる」）が与えられた場合、MolmoMotion はこれらのポイントが未来の数秒間に 3D 空間内でどこへ移動するかを予測し、既存の予測手法よりも大幅に優れた性能を実現します。

動画を見る

*RGB 観測データ、物体上の一連のクエリポイント、および動作記述が与えられた場合、MolmoMotion は物体の将来の 3D ポイント軌道を予測します。これらの予測された軌道は、その後、ロボティクス計画や軌道条件付き動画生成などの下流アプリケーションを駆動するために使用できます。

モデルとともに、116 万本の動画から抽出された、動作記述とペアになった 3D ポイント軌道の最大規模コレクションであるMolmoMotion-1Mも公開します。また、物体中心の 3D 運動予測精度を測定するために設計された人間検証済みベンチマークであるPointMotionBench（2,700 本の動画クリップを含む）も公開します。

MolmoMotion のような運動予測モデルは、ロボット計画から制御可能な動画生成に至るまで、さまざまな下流タスクにおいて有用であることがわかりました。私たちは、コミュニティが研究・改善・カスタマイズできるよう、モデルの重み、MolmoMotion-1M データセット、および PointMotionBench ベンチマークを公開します。

MolmoMotion: 内部構造

MolmoMotion は、運動を意図的かつ極めて効率的な方法で表現します。具体的には、世界空間における物体に付随する 3D ポイントとして表現し、フル動画のレンダリングコストをかけずに運動を捉えます。私たちはこのアプローチを選んだのは、以下の三つの性質を持つ汎用的な運動表現が必要だったからです。

クラス非依存：人間の体や手、剛体オブジェクト、あるいはその他の固定カテゴリに紐づくテンプレートに縛られないこと。
視点安定性：同じ物理的運動が、カメラや視点の変化に関わらず一貫して表現されること。
下流システムによる直接利用：物理的な運動について推論を必要とする下流システムで直接使用可能であること。

私たちが検討した表現の中で、これら三つの条件をすべて満たしたのはこれだけでした。表面の限られた数のポイントセットは、移動するオブジェクトの種類を仮定することなく、剛体・関節付き・（一定の範囲内で）変形する運動を記述できます。ポイントは共有された世界座標系に存在するため、カメラの動きや視点の変化に対してその軌道が安定します。また、これらは 3D 空間におけるコンパクトで明示的な軌道であるため、ロボットポリシーや動画生成モデルなどのシステムに直接渡すことが可能です。

これらの軌道を予測するために、MolmoMotion は Molmo 2 をバックボーンとして使用し、画像内のオブジェクトやポイントと言語指示を結びつけることを可能にします。短い動画履歴、アクションの説明、および初期の 3D 位置を持つ一連のクエリポイントが与えられた場合、モデルはまず参照されているオブジェクト、クエリポイント、そして指示が記述する運動を特定します。その後、各ポイントの将来の 3D 軌道を予測します。

MolmoMotion の 2 つの変種を訓練しています:

オートレジレス変種（MolmoMotion-AR）は、未来の座標を段階的に予測します。これは VLM が使用する座標スタイル予測に従い、3D 座標を構造化されたテキストとして表現し、時間順に未来の軌道を書き出します。各新しい座標がすでに生成された軌道に条件付けられているため、これは滑らかなロールアウトを促し、将来の経路が明確に定義されている場合に最も高い精度をもたらします。

フローマッチング変種（MolmoMotion-FM）は、ノイズから運動への変換によって連続的な 3D 空間内で軌道を予測するため、指示が複数の妥当な未来を許容する場合の不確実性を表現するのに適しています。

*MolmoMotion のアーキテクチャ。Molmo 2 バックボーンへの共通入力には、RGB 観測の画像トークン、アクション記述のテキストトークン、および Molmo 2 ビジョンエンコーダからサンプリングされた 2D クエリポイント特徴トークンが含まれます。MolmoMotion-AR は初期の 3D クエリ座標を符号化し、未来の軌道を量子化された座標テキストとしてデコードしますが、MolmoMotion-FM はそれらを連続的な 3D 座標空間で直接表現します。

MolmoMotion-1M と PointMotionBench の紹介

MolmoMotion を訓練するには、まだ存在しなかったデータが必要でした：特定のオブジェクトに grounded（根付いた）された 3D ポイント軌道を持ち、アクション記述とペアになった大規模なビデオです。既存の 3D-トラックデータセットは小規模でドメインが限定されており、インターネット上のビデオには MolmoMotion のような予測器に必要なスケールと多様性がすべて含まれていますが、3D アノテーションが含まれていませんでした。そこで、制約のないビデオからオブジェクトに grounded な 3D 軌道を抽出する自動パイプラインを構築しました。

入力動画とその動作説明に基づき、当社の注釈パイプラインは、メトリックな世界座標系における物体に紐付いた 3D ポイントの軌道データを生成します。（以下の図では各工程を示しています。）課題となるのは、制約のない動画から得られる生データがノイズを含み、奥行きや追跡誤差によりポイントがジッターしたりドリフトしたりすること、そして多くの場合、物体は動画の大部分で静止していることです。データの信頼性を高めるため、物体の他の部分と一貫して移動しないポイントを除外し、残りの軌道を平滑化処理した上で、実際に物体が動く区間のみを各クリップから切り出します。

大規模にパイプラインを実行した結果、MolmoMotion-1M が生成されました。これは、動作説明付きで物体に紐付いた 3D ポイント軌道データとして、現時点までに収集された中で最大規模のコーパスであり、736 の運動タイプと 5.6K の異なる物体を網羅しています。

動画を見る

*当社のデータ注釈パイプラインの概要。動作イベントとその説明を含む動画に対して、まず移動する物体を特定し、その上にクエリポイントをサンプリングします。次に、物体上の高密度な 2D ポイントを追跡し、これらの軌道を共通のメトリックな 3D フレームに変換（リフト）します。その後、物体レベルの空間的・時間的一貫性に関する事前知識を用いて信頼性の低い軌道データをフィルタリングします。最後に、特定された物体が意味のある運動を行う区間を中心に動画をクリップします。*

*トップの指示：テーブルの上の果物が入った木製のボウルを動かし、回転させてください。 ボトムの指示：青い布の上にランナーローラーを転がしてください。*

*トップの指示：銀色の車が道路に沿って走り、ゆっくりと右に曲がります。 ボトムの指示：フラミンゴが右へ歩きながら嘴を水に浸します。*

MolmoMotion の予測性能を評価するため、私たちは PointMotionBench を構築しました。これは保持された 3D 軌道に基づく人間検証済みベンチマークです。このベンチマークは屋内での操作、自己中心の手と物体の相互作用、屋外の動的シーンを含む、111 の物体カテゴリと 61 の運動タイプにわたる 2.7K クリップを網羅しています。各クリップにおいて、モデルには現在の観測値、物体クエリポイント、および動作記述が与えられ、予測された 3D ポイント軌道が物体の実際の未来運動とどれだけ正確に一致するかが評価されます。これにより、生成されたポイントトラックが単に妥当に見えるかどうかという根拠に頼るのではなく、3D 運動予測に対する直接的な定量的テストが可能になります。

実験と性能

MolmoMotion を 3 つの観点から評価します。第一に、既存手法よりも未来の 3D 運動をより正確に予測できるかどうかを検証します。第二に、運動に関する学習内容がロボットによる操作タスクの実行を支援できるかどうかを検証します。第三に、同じ知識が生成された動画における動作誘導に役立つかどうかを検証します。

3D 運動予測

On PointMotionBench において、MolmoMotion は、ピクセル空間のビデオ生成器、パラメトリックな 3D 手法、単純な等速度ベースラインを含む、テストした既存のすべての 3D モーション予測手法を上回りました。これは多様なオブジェクト、シーン、アクションにわたる結果です。

MolmoMotion は、布地の上で lint roller が前後に動く様子や、テーブルの上でボウルが滑りながら回転する様子、水たまりの中でくちばしを浸しながら右へ歩くフラミンゴの様子、あるいはカーブする道路に沿って車が移動する様子など、さまざまな種類のオブジェクトおよびシーンの動きを予測できます。いずれの場合も、予測された経路は MolmoMotion に対して与えられた指示に従い、ベンチマークにおける真の運動（ground truth motion）と極めて近い位置に留まります。

Downstream evaluation: robotics planning

MolmoMotion が学習したモーション情報は、異なる設定間でも転移可能です。人間の手でカップを持ち上げる行為とロボットグリッパーで持ち上げる行為は非常に異なりますが、カップ自体が 3D 空間を通る経路は類似しています。このため、オブジェクトを移動させる前にその動きを計画する必要があるロボティクスにおいて、MolmoMotion は自然な適合性を示します。

実世界のロボット操作ビデオの大規模オープンデータセットである DROID でファインチューニングを行った結果、MolmoMotion が幅広いロボット計画シナリオにおいて、異なるオブジェクト、カメラ視点、シーン、タスクに対して、合理的なオブジェクト経路を予測できることがわかりました。

View video

動画を見る

*上部の指示：「容器から布を取り出してください。」 下部の指示：「鍋の蓋を動かしてください。」*

シミュレーションにおいて、MolmoMotion を基盤とした制御ポリシーは、ピッキング＆プレースタスクの 76.3% で成功し、同じく Molmo 2 を基盤としたポリシーの 56.0% を上回ります。また学習速度も速く、10K トレーニングステップ後に 51% に達するのに対し、Molmo 2 バージョンは最大でも 19% で頭打ちになります。実機ロボット（ファインチューニング後）では、MolmoMotion は約 2K ステップで、Molmo 2 ベースラインが 12K トレーニングステップ後に達成するのと同じテスト L2 エラー値に到達します。

下流タスク評価：動画生成

動画を見る

*指示：「フラミンゴが右へ歩きながら嘴を水に浸します。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

動画を見る

*指示：「テーブルから丸い茶色の皿を取ってください。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

MolmoMotion が予測した経路は、ビデオ生成の制御にも活用できます。テキスト指示のみから画像から動画へのモデルが動きを推測させるのではなく、MolmoMotion の予測結果を入力として与えることで、プロンプトでは曖昧にしか記述できない小さな精密な動きにおいても、要求された動作により忠実に従う生成されたビデオを実現できます。

この効果を裏付ける指標もあります。動画生成器のガイドとして使用した場合、MolmoMotion は測定した 5 つの運動関連指標すべてにおいてベースモデルよりも運動品質を向上させ、さらにはるかに大きな画像から動画へのモデルに対しては 5 つのうち 4 つで上回っています。

制限事項と今後の展望

MolmoMotion は能力のあるモデルですが、いくつかの制限事項にも注意が必要です。トレーニング中はオブジェクトあたり 8 つのクエリポイントを使用しており、有用な軌道を予測するには十分ですが、表面幾何学を密に表現するには不十分です。このため、複雑な変形運動への対応には限界があります。

私たちは、世界内の物体が動く「前に」その動きを予測する予測（forecasting）は、すでに存在するものを知覚することと同様に、機械知能にとって根本的な要素であると信じています。MolmoMotion はその一歩であり、カテゴリごとのテンプレートなしでオブジェクトカテゴリ全体に一般化し、通常の動画から学習され、PointMotionBench で測定した中で最も正確な 3D 運動予測器です。ロボット工学、ビデオ、そしてそれ以上の分野において、多くの応用が今後現れることを期待しています。

MolmoMotion の重みダウンロード、トレーニングデータ確認、そして PointMotionBench に対する当社の手法の評価実施を通じて、ぜひ MolmoMotion をお試しください。

原文を表示

Back to Articles

MolmoMotion: Under the hood
Introducing MolmoMotion-1M and PointMotionBench
Experiments and performance 3D motion forecasting
Downstream evaluation: robotics planning
Downstream evaluation: video generation

Limitations and what's next

🧠 Models: https://huggingface.co/collections/allenai/molmomotion | 📄 Tech Report: https://allenai.org/papers/molmomotion | 📊 Data: https://huggingface.co/datasets/allenai/molmo-motion-1m | 💻 Code: https://github.com/allenai/molmo-motion.git | 🌐 Project Page: https://molmomotion.github.io/

Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to *look forward* instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames.

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

This idea was the motivation behind MolmoMotion, a new motion forecasting model we're releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

View video

*Given an RGB observation, a set of query points on an object, and an action description, MolmoMotion predicts the object's future 3D point trajectory. These predicted trajectories can then drive downstream applications such as robotics planning and trajectory-conditioned video generation.*

Alongside the model, we're publishing MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We're also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips.

We find that motion forecasters like MolmoMotion can be useful across a range of downstream tasks, from robot planning to controllable video generation. We're releasing the model weights, the MolmoMotion-1M dataset, and our PointMotionBench benchmark openly for the community to study, improve, and customize.

MolmoMotion: Under the hood

MolmoMotion represents motion in a deliberate, highly efficient way: as object-attached 3D points in world space, which capture motion without the cost of rendering full video. We chose it because we needed a general motion representation with three properties:

Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.

View-stable: the same physical motion should be represented consistently across cameras and viewpoints.

Directly usable by downstream systems that need to reason about physical motion.

Among the representations we considered, it was the only one that satisfied all three. A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change. And because they're compact explicit trajectories in 3D space, they can be passed directly to systems such as robot policies or video generation models.

To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. Given a short video history, an action description, and a set of query points with their initial 3D positions, the model first identifies the object being referred to, the query points, and the motion the instruction describes. It then predicts the future 3D trajectory of each point.

We train two variants of MolmoMotion:

The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.

The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

*The MolmoMotion architecture. The shared input to the Molmo 2 backbone consists of image tokens of RGB observations, text tokens of action description, and 2D query point feature tokens sampled from the Molmo 2 vision encoder. MolmoMotion-AR encodes the initial 3D query coordinates and decodes future trajectories as quantized coordinate text, while MolmoMotion-FM represents them directly in continuous 3D coordinate space.*

Introducing MolmoMotion-1M and PointMotionBench

To train MolmoMotion, we needed data that didn’t yet exist: large-scale videos with 3D point trajectories grounded to specific objects and paired with action descriptions. Existing 3D-track datasets were small and domain-limited, and while internet videos have all the scale and diversity we wanted for a forecaster like MolmoMotion, they didn’t include 3D annotations. So we built an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video.

Given an input video and its action description, our annotation pipeline produces object-grounded 3D point trajectories in metric world coordinates. (The figure below shows each stage.) The challenging part is that raw tracks from unconstrained video are noisy – with depth and tracking errors that leave points jittering and drifting – and that objects often stay still for much of a video. To make the data more trustworthy, we filter out points that don't move coherently with the rest of the object, smooth the remaining trajectories, and segment each clip to the window where the object actually moves.

Running our pipeline at scale yielded MolmoMotion-1M—to our knowledge the largest corpus of action-described, object-grounded 3D point trajectories assembled to date, spanning 736 motion types and 5.6K distinct objects.

View video

*An overview of our data annotation pipeline. Given a video of an action event and its description, we first ground the moving object and sample query points on it. We then track dense 2D points on the object, lift these tracks into a shared metric 3D frame, and use object-level spatial and temporal consistency priors to filter unreliable trajectories. Finally, we clip the video around intervals where the grounded object undergoes meaningful motion.*

*Top instruction: "Move and rotate wooden bowl with fruits on the table." Bottom instruction: "Roll a lint roller on a blue cloth."*

*Top instruction: "A silver car follows the road and slowly turns to the right." Bottom instruction: "A flamingo dips its beak into the water while walking to the right."*

To evaluate MolmoMotion’s forecasting performance, we also built PointMotionBench, a human-validated benchmark of held-out 3D trajectories. It covers 2.7K clips spanning 111 object categories and 61 motion types, including indoor manipulation, egocentric hand-object interaction, and outdoor dynamic scenes. For each clip, models are given the current observation, object query points, and an action description, and are evaluated on how accurately their predicted 3D point trajectories match the object’s actual future motion. This gives us a direct quantitative test of 3D motion forecasting rather than relying on whether a generated point track merely looks plausible.

Experiments and performance

We evaluate MolmoMotion in three ways. First, we test whether it forecasts future 3D motion more accurately than existing methods. Second, we test whether what it has learned about motion helps a robot carry out manipulation tasks. Third, we test whether that same knowledge can help guide the motion in generated video.

3D motion forecasting

On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods we tested – including pixel-space video generators, parametric 3D methods, and a simple constant-velocity baseline – across a range of objects, scenes, and actions.

MolmoMotion can forecast many kinds of object and scene motions, like how a lint roller will move back and forth on cloth, how a bowl will slide and rotate on a table, how a flamingo will walk to the right while dipping its beak in a body of water, or how a car will follow a road as it turns. In each case, the predicted path follows the instruction MolmoMotion was given and stays extremely close to the ground truth motion in our benchmark.

Downstream evaluation: robotics planning

What MolmoMotion learns about motion should carry over from one setting to another—lifting a cup with a human hand and lifting it with a robot gripper are very different actions, but the cup itself follows a similar path through 3D space. That makes MolmoMotion a natural fit for robotics, where a robot has to plan how objects should move before moving them.

After fine-tuning on DROID, a large open dataset of real-world robot manipulation videos, we find that MolmoMotion can predict sensible object paths across different objects, camera viewpoints, scenes, and tasks for a wide range of robot planning scenarios.

View video

*Top instruction: “Take cloth out of container." Bottom instruction: “Move lid on pot.”*

In simulation, a control policy built on MolmoMotion succeeds on 76.3% of pick-and-place tasks versus 56.0% for the same policy built on Molmo 2—and it learns faster, reaching 51% after 10K training steps where the Molmo 2 version tops out at 19%. On real robots (after fine-tuning), MolmoMotion reaches the same test L2 error that the Molmo 2 baseline achieves after 12K training steps in only about 2K steps.

Downstream evaluation: video generation

View video

*Instruction: “A flamingo dips its beak into the water while walking to the right.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

View video

*Instruction: "Take the round light brown plate from the table.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

MolmoMotion's predicted paths can also steer video generation. Instead of letting an image-to-video model guess motion from a text instruction alone, you can feed in MolmoMotion's predictions. The result is generated video that follows requested actions more closely, especially for small and precise movements a prompt can only describe vaguely.

The metrics back this up. Used to guide a video generator, MolmoMotion improves motion quality over the base model on all five motion-related metrics we measure, and beats a much larger image-to-video model on four of the five.

Limitations and what's next

MolmoMotion is a capable model, but there are still some limitations to note. It uses eight query points per object during training—enough to forecast a useful trajectory but not enough to densely represent surface geometry. This limits the model's handling of complex deformable motion.

We think forecasting – anticipating how objects in the world will move *before* they move – is as fundamental to machine intelligence as perceiving what's already there. MolmoMotion is a step toward this—3D motion prediction that generalizes across object categories without per-category templates, learned from ordinary video, and the most accurate 3D motion forecaster we've measured on PointMotionBench. We expect many applications will follow in robotics, video, and beyond.

We encourage you to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench.

この記事をシェア

TLDR AI★42026年6月18日 09:00

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

新しいモーション予測モデル「MolmoMotion」は、言語指示と初期物体位置を用いて動画内の未来の 3D ポイント軌道を予測し、既存手法を上回る精度を達成しました。また、116 万本の動画からなる大規模データセット「MolmoMotion-1M」も公開されています。

AWS Machine Learning Blog★42026年6月19日 23:05

Adobe Marketing Agent for Amazon Quick によるキャンペーンワークフローの加速

AWS と Adobe は、Amazon Quick と Adobe Marketing Agent を連携させることで、マーケティングチームが自然言語で質問するだけで、ガバナンスされた会話環境内で数秒以内にキャンペーンのパフォーマンスやオーディエンスに関するインサイトにアクセスできるようにした。

AWS Machine Learning Blog★42026年6月19日 08:31

CloudWatch の SageMaker メトリクスとインサイトダッシュボードを用いた生成 AI 推論の監視・デバッグ

AWS は、大規模な生成 AI 推論エンドポイントの P99 レイテンシ急上昇などのトラブルを GPU メモリ圧力や KV キャッシュ飽和などから特定できるよう、CloudWatch に SageMaker の詳細メトリクスとインサイトダッシュボードを追加した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年6月18日 00:26·約14分で読める

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

#3D Motion Forecasting #Generative AI #Multimodal #Hugging Face

TL;DR

Hugging Face は、テキスト入力から物体やキャラクターの未来動作を推定する言語指示対応型 3D モーション予測モデル「MolmoMotion」を発表した。

AI深層分析2026年6月17日 16:04

重要/ 5段階

深度40%

キーポイント

言語による 3D モーション制御の実現

テキスト入力（自然言語）を受け取り、物体やキャラクターの未来の動きを予測・生成する能力を備えた新しいモデルである。

Hugging Face による技術発表

オープンソースコミュニティで知られる Hugging Face が、この新技術をブログを通じて発表した。

マルチモーダル・生成 AI の進展

テキスト理解と 3D 空間の動き予測を統合することで、AI の表現能力がさらに拡張されたことを示している。

影響分析・編集コメントを表示

影響分析

編集コメント

記事一覧に戻る

MolmoMotion: 内部構造の解説
MolmoMotion-1M と PointMotionBench の紹介
実験とパフォーマンス：3D モーション予測
下流タスク評価：ロボティクス計画
下流タスク評価：動画生成
限界と今後の展望

動きを予測することは、それを観察するよりも難しいですが、多くのシナリオでははるかに有用です。

動画を見る

MolmoMotion: 内部構造

クラス非依存：人間の体や手、剛体オブジェクト、あるいはその他の固定カテゴリに紐づくテンプレートに縛られないこと。
視点安定性：同じ物理的運動が、カメラや視点の変化に関わらず一貫して表現されること。
下流システムによる直接利用：物理的な運動について推論を必要とする下流システムで直接使用可能であること。

MolmoMotion の 2 つの変種を訓練しています:

オートレジレス変種（MolmoMotion-AR）は、未来の座標を段階的に予測します。これは VLM が使用する座標スタイル予測に従い、3D 座標を構造化されたテキストとして表現し、時間順に未来の軌道を書き出します。各新しい座標がすでに生成された軌道に条件付けられているため、これは滑らかなロールアウトを促し、将来の経路が明確に定義されている場合に最も高い精度をもたらします。

フローマッチング変種（MolmoMotion-FM）は、ノイズから運動への変換によって連続的な 3D 空間内で軌道を予測するため、指示が複数の妥当な未来を許容する場合の不確実性を表現するのに適しています。

MolmoMotion-1M と PointMotionBench の紹介

動画を見る

*トップの指示：銀色の車が道路に沿って走り、ゆっくりと右に曲がります。 ボトムの指示：フラミンゴが右へ歩きながら嘴を水に浸します。*

実験と性能

3D 運動予測

Downstream evaluation: robotics planning

View video

動画を見る

*上部の指示：「容器から布を取り出してください。」 下部の指示：「鍋の蓋を動かしてください。」*

下流タスク評価：動画生成

動画を見る

*指示：「フラミンゴが右へ歩きながら嘴を水に浸します。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

動画を見る

*指示：「テーブルから丸い茶色の皿を取ってください。」 上から下へ：DaS + MolmoMotion、CogVideoX-5B、WAN-14B。

制限事項と今後の展望

原文を表示

Back to Articles

MolmoMotion: Under the hood
Introducing MolmoMotion-1M and PointMotionBench
Experiments and performance 3D motion forecasting
Downstream evaluation: robotics planning
Downstream evaluation: video generation

Limitations and what's next

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

View video

MolmoMotion: Under the hood

Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.

View-stable: the same physical motion should be represented consistently across cameras and viewpoints.

Directly usable by downstream systems that need to reason about physical motion.

We train two variants of MolmoMotion:

The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.

The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

Introducing MolmoMotion-1M and PointMotionBench

View video

*Top instruction: "Move and rotate wooden bowl with fruits on the table." Bottom instruction: "Roll a lint roller on a blue cloth."*

*Top instruction: "A silver car follows the road and slowly turns to the right." Bottom instruction: "A flamingo dips its beak into the water while walking to the right."*

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

View video

*Top instruction: “Take cloth out of container." Bottom instruction: “Move lid on pot.”*

Downstream evaluation: video generation

View video

*Instruction: “A flamingo dips its beak into the water while walking to the right.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

View video

*Instruction: "Take the round light brown plate from the table.” From top to bottom: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.*

Limitations and what's next

We encourage you to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench.

この記事をシェア

TLDR AI★42026年6月18日 09:00

言語指示による 3D モーション予測モデル「MolmoMotion」を発表

AWS Machine Learning Blog★42026年6月19日 23:05

Adobe Marketing Agent for Amazon Quick によるキャンペーンワークフローの加速

AWS Machine Learning Blog★42026年6月19日 08:31

CloudWatch の SageMaker メトリクスとインサイトダッシュボードを用いた生成 AI 推論の監視・デバッグ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

キーポイント

影響分析

編集コメント

MolmoMotion: 内部構造

MolmoMotion-1M と PointMotionBench の紹介

実験と性能

3D 運動予測

Downstream evaluation: robotics planning

下流タスク評価：動画生成

制限事項と今後の展望

MolmoMotion: Under the hood

Introducing MolmoMotion-1M and PointMotionBench

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

Downstream evaluation: video generation

Limitations and what's next

関連記事

MolmoMotion：言語指示に基づく 3D モーション予測技術の発表

キーポイント

影響分析

編集コメント

MolmoMotion: 内部構造

MolmoMotion-1M と PointMotionBench の紹介

実験と性能

3D 運動予測

Downstream evaluation: robotics planning

下流タスク評価：動画生成

制限事項と今後の展望

MolmoMotion: Under the hood

Introducing MolmoMotion-1M and PointMotionBench

Experiments and performance

3D motion forecasting

Downstream evaluation: robotics planning

Downstream evaluation: video generation

Limitations and what's next

関連記事