AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
Berkeley AI Research·2026年4月20日 18:00·約3分で読める

長期ホライズンにおけるワールドモデルのための勾配ベース計画手法

#World Models#勾配最適化#ロボティクス制御#強化学習#BAIR
TL;DR

BAIRが提案するGRASPは、仮想状態への軌道昇格、状態反復への確率性付与、勾配再形成の3手法により、高次元ビジョンモデルにおける長期ホライズン計画の実用化を実現する勾配ベースのプランナーである。

AI深層分析2026年4月21日 06:55
4
重要/ 5段階
深度40%
4
関連度30%
4
実用性20%
3
革新性10%
4

キーポイント

1

長期計画の脆弱性克服

現代の世界モデルにおける最適化の悪条件化、局所最小値、高次元潜在空間による失敗モードを特定し、その根本原因を体系的に分析する。

2

仮想状態への軌道昇格

時間軸に沿った最適化を並列化するために、軌道を仮想状態に昇格させる手法を導入し、計算効率と収束安定性を同時に向上させる。

3

確率的状態反復と勾配再形成

探索のために状態反復に直接確率性を付与し、高次元ビジョンモデルを通じた「状態-入力」勾配の脆さを回避しながら行動信号をクリーンに保つ。

4

長期ロールアウトによる計算グラフの悪条件化

モデルを自己適用して繰り返す微分によりヤコビアンの条件数が時間Tに対して指数関数的に悪化し、勾配爆発または消失を引き起こす。

5

非貪欲な行動と最適化空間の拡大

長期計画では障害物回避など非貪欲な行動が必要となり、最適化空間が時間T倍に拡大することで局所解やトラップが増加する。

6

状態最適化の脆弱性と敵対的例の問題

状態データ流形の次元は通常、アクション空間の次元より遥かに小さいため、直接状態 $s_t$ を最適化すると容易に敵対的例(adversarial examples)が発生し、ダイナミクス最適化が「sticky」で欺瞞的なものになる。

7

GRASPによるアクション勾配最適化の提案

提案されたプランナーGRASPは、信頼性の低い状態勾配 $D_s F_ heta$ ではなく、低次元で密に学習されたアクション空間に基づく勾配 $D_a F_ heta$ を最適化対象とする。これにより敵対的堅牢性の問題を防ぎ、より安定した計画が可能になる。

影響分析・編集コメントを表示

影響分析

本研究は、単なる予測モデルから自律制御システムへ移行する上で不可欠な「計画(Planning)」の課題を、勾配最適化の観点から体系的に解決した。これにより、複雑な視覚環境での長期タスク達成が現実的な計算資源で可能になり、ロボティクスや強化学習の分野における実証実験のハードルが下がる。今後の世界モデル研究は、予測精度だけでなく計画安定性への転換が進むと予想される。

編集コメント

予測モデルの進化が止まらない中、その出力を制御に繋げる「計画」部分の最適化手法を明確にした点は評価できる。実環境でのロボティクス適用に向けて、計算コストと安定性のバランスを取るこのアプローチは今後の標準手法になり得る。

image
image

imageimage

GRASPは、学習されたダイナミクス(learned dynamics)「ワールドモデル」のための新しい勾配ベースのプランナー (gradient-based planner) であり、(1) 軌道を仮想状態 (virtual states) に持ち上げて時間全体で最適化を並列化する、(2) 探索 (exploration) のために状態反復値 (state iterates) に直接確率性を追加する、(3) 高次元ビジョンモデル (high-dimensional vision models) を通じた脆い「状態-入力」勾配を回避しつつ、アクションに明確なシグナルを与えながら勾配の再形成 (reshaping gradients) を行う、という3つの手法により、長期ホライズンの計画 (long-horizon planning) を実用的なものにしている。

大規模な学習済みワールドモデルは、ますます高度な能力を備えつつある。これらは高次元の視覚空間において未来観測の長期シーケンスを予測し、数年前には想像も困難だった方法でタスク間で汎化できる。これらのモデルがスケールするにつれ、それらはもはや特定のタスクに特化した予測機というよりは、汎用シミュレータのように見え始めてきている。

しかし、強力な予測モデルを持つことと、それを制御/学習/計画のために効果的に使用できることは同じではない。実際には、現代のワールドモデルを用いた長期ホライズンの計画は依然として脆いままとなっている:最適化が不適条件 (ill-conditioned) になり、非貪欲な構造 (non-greedy structure) が悪く局所的最小値 (local minima) を生み出し、高次元の潜在空間 (latent spaces) が微妙な失敗モード (failure modes) をもたらす。

このブログ記事では、このプロジェクトの動機となった問題と、それに対処する私たちのアプローチについて述べる。具体的には、現代のワールドモデルを用いた計画がなぜ驚くほど脆くなり得るのか、なぜ長期ホライズンが真のストレステスト (stress test) となるのか、そして勾配ベースの計画をより堅牢にするために私たちが何を変更したのかについて説明する。

**

このブログ記事では、Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar との共同作業について述べる(* は同等の指導教員を示す)。ここで私たちは GRASP を提案する。

ワールドモデルとは何か?

近年、「ワールドモデル」(world model) という用語はかなり多義的になっており、文脈によっては明示的なダイナミクスモデル (explicit dynamics model) を指す場合もあれば、生成モデル (generative model) が依存する何らかの暗黙的かつ信頼性の高い内部状態を指す場合もある(例:LLM がチェスの手を生成する際、盤面の内部表現が存在するかどうか)。以下に、私たちの緩やかな作業定義を示す。

行動 $a_t \in \mathcal{A}$ を取り、状態 $s_t \in \mathcal{S}$(画像、潜在ベクトル、固有受容感覚)を観察すると仮定する。ワールドモデル**は、現在の状態と未来の行動シーケンスが与えられたとき、次に何が起こるかを予測する学習済みモデルである。形式的には、観測された状態のシーケンス $s_{t-h:t}$ と現在の行動 $a_t$ 上の予測分布 (predictive distribution) を定義する:

\[P_\theta(s_{t+1} \mid s_{t-h:t},\; a_t)\]

これは環境の真の条件付き確率 $P(s_{t+1} \mid s_{t-h:t},\; a_t)$ を近似する。このブログ記事では、簡略化のためマルコフモデル (Markovian model) $P(s_{t+1} \mid s_{t-h:t},\; a_t)$ を仮定する(ここでの結果はより一般的なケースに拡張可能である)。また、モデルが決定論的 (deterministic) な場合、これは状態上の写像に簡約される:

\[s_{t+1} = F_\theta(s_t, a_t).\]

原文を表示
BallNav demo
BallNav demo
Push-T demo
Push-T demo

GRASP is a new gradient-based planner for learned dynamics (a “world model”) that makes long-horizon planning practical by (1) lifting the trajectory into virtual states so optimization is parallel across time, (2) adding stochasticity directly to the state iterates for exploration, and (3) reshaping gradients so actions get clean signals while we avoid brittle “state-input” gradients through high-dimensional vision models.

Large, learned world models are becoming increasingly capable. They can predict long sequences of future observations in high-dimensional visual spaces and generalize across tasks in ways that were difficult to imagine a few years ago. As these models scale, they start to look less like task-specific predictors and more like general-purpose simulators.

But having a powerful predictive model is not the same as being able to use it effectively for control/learning/planning. In practice, long-horizon planning with modern world models remains fragile: optimization becomes ill-conditioned, non-greedy structure creates bad local minima, and high-dimensional latent spaces introduce subtle failure modes.

In this blog post, I describe the problems that motivated this project and our approach to address them: why planning with modern world models can be surprisingly fragile, why long horizons are the real stress test, and what we changed to make gradient-based planning much more robust.

This blog post discusses work done with Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar (* denotes equal advisorship), where we propose GRASP.

What is a world model?

These days, the term “world model” is quite overloaded, and depending on the context can either mean an explicit dynamics model or some implicit, reliable internal state that a generative model relies on (e.g. when an LLM generates chess moves, whether there is some internal representation of the board). We give our loose working definition below.

Suppose you take actions $a_t \in \mathcal{A}$ and observe states $s_t \in \mathcal{S}$ (images, latent vectors, proprioception). A world model is a learned model that, given the current state and a sequence of future actions, predicts what will happen next. Formally, it defines a predictive distribution on a sequence of observed states $s_{t-h:t}$ and current action $a_t$:

\[P_\theta(s_{t+1} \mid s_{t-h:t},\; a_t)\]

that approximates the environment’s true conditional $P(s_{t+1} \mid s_{t-h:t},\; a_t)$. For this blog post, we’ll assume a Markovian model $P(s_{t+1} \mid s_{t-h:t},\; a_t)$ for simplicity (all results here can be extended to the more general case), and when the model is deterministic it reduces to a map over states:

\[s_{t+1} = F_\theta(s_t, a_t).\]

In practice the state $s_t$ is often a learned latent representation (e.g., encoded from pixels), so the model operates in a (theoretically) compact, differentiable space. The key point is that a world model gives you a *differentiable simulator*; you can roll it forward under hypothetical action sequences and backpropagate through the predictions.

Planning: choosing actions by optimizing through the model

Given a start $s_0$ and a goal $g$, the simplest planner chooses an action sequence $\mathbf{a}=(a_0,\dots,a_{T-1})$ by rolling out the model and minimizing terminal error:

\[\min_{\mathbf{a}} \; \| s_T(\mathbf{a}) - g \|_2^2, \quad \text{where } s_T(\mathbf{a}) = \mathcal{F}_{\theta}^{T}(s_0,\mathbf{a}).\]

Here we use $\mathcal{F}^T$ as shorthand for the full rollout through the world model (dependence on model parameters $\theta$ is implicit):

\[\mathcal{F}_{\theta}^{T}(s_0, \mathbf{a}) = F_\theta(F_\theta(\cdots F_\theta(s_0, a_0), \cdots, a_{T-2}), a_{T-1}).\]

In short horizons and low-dimensional systems, this can work reasonably well. But as horizons grow and models become larger and more expressive, its weaknesses become amplified.

So why doesn’t this just work at scale?

Why long-horizon planning is hard (even when everything is differentiable)

There are two separate pain points for the more general world model, plus a third that is specific to learned, deep learning-based models.

1) Long-horizon rollouts create deep, ill-conditioned computation graphs

Those familiar with backprop through time (BPTT) may notice that we’re differentiating through a model applied to itself repeatedly, which will lead to the exploding/vanishing gradients problem. Namely, if we take derivatives (note we’re differentiating vector-valued functions, resulting in Jacobians that we denote with $D_x (\cdots)$) with respect to earlier actions (e.g. $a_0$):

\[D_{a_0} \mathcal{F}_{\theta}^{T}(s_0, \mathbf{a}) = \Bigl(\prod_{t=1}^T D_s F_\theta(s_t, a_t)\Bigr) D_{a_0}F_\theta(s_0, a_0).\]

We see that the Jacobian’s conditioning scales exponentially with time $T$:

\[\sigma_{\text{max/min}}(D_{a_0}\mathcal{F}_{\theta}^{T}) \sim \sigma_{\text{max/min}}(D_s F_\theta)^{T-1},\]

leading to exploding or vanishing gradients.

2) The landscape is non-greedy and full of traps

At short horizons, the greedy solution, where we move straight toward the goal at every step, is often good enough. If you only need to plan a few steps ahead, the optimal trajectory usually doesn’t deviate much from “head toward $g$” at each step.

As horizons grow, two things happen. First, longer tasks are more likely to require *non-greedy* behavior: going around a wall, repositioning before pushing, backing up to take a better path. And as horizons grow, more of these non-greedy steps are typically needed. Second, the optimization space itself scales with horizon: $\mathrm{dim}(\mathcal{A} \times \cdots \times \mathcal{A}) = T\mathrm{dim}(\mathcal{A})$, further expanding the space of local minima for the optimization problem.

Distance to goal along the optimal path is non-monotonic, and the resulting loss landscape can be rough.
Distance to goal along the optimal path is non-monotonic, and the resulting loss landscape can be rough.

A long-horizon fix: lifting the dynamics constraint

Suppose we treat the dynamics constraint $s_{t+1} = F_{\theta}(s_t, a_t)$ as a soft constraint, and we instead optimize the following penalty function over both actions $(a_0,\ldots,a_{T-1})$ and states $(s_0,\ldots,s_T)$:

\[\min_{\mathbf{s},\mathbf{a}} \mathcal{L}(\mathbf{s}, \mathbf{a}) = \sum_{t=0}^{T-1} \big\|F_\theta(s_t,a_t) - s_{t+1}\big\|_2^2,

\quad \text{with } s_0 \text{ fixed and } s_T=g.\]

This is also sometimes called *collocation* in planning/robotics literature. Note the lifted formulation shares the same *global* minimizers as the original rollout objective (both are zero exactly when the trajectory is dynamically feasible). But the optimization landscapes are very different, and we get two immediate benefits:

  • Each world model evaluation $F_{\theta}(s_t,a_t)$ depends only on local variables, so all $T$ terms can be computed in parallel across time, resulting in a huge speed-up for longer horizons, and
  • You no longer backpropagate through a single deep $T$-step composition to get a learning signal, since the previous product of Jacobians now splits into a sum, e.g.:

\[D_{a_0} \mathcal{L} = 2(F_\theta(s_0, a_0) - s_1).\]

Being able to optimize states directly also helps with exploration, as we can temporarily navigate through unphysical domains to find the optimal plan:

Collocation-based planning allows us to directly perturb states and explore midpoints more effectively.
Collocation-based planning allows us to directly perturb states and explore midpoints more effectively.

However, lunch is never free. And indeed, especially for deep learning-based world models, there is a critical issue that makes the above optimization quite difficult in practice.

An issue for deep learning-based world models: sensitivity of state-input gradients

The tl;dr of this section is: directly optimizing states through a deep learning-based $F_{\theta}$ is incredibly brittle, à la *adversarial robustness*. Even if you train your world model in a lower-dimensional state space, the training process for the world model makes unseen state landscapes very sharp, whether it be an unseen state itself or simply a normal/orthogonal direction to the data manifold.

Adversarial robustness and the “dimpled manifold” model

Adversarial robustness originally looked at classification models $f_\theta : \mathbb{R}^{w\times h \times c} \to \mathbb{R}^K$, and showed that by following the gradient of a particular logit $\nabla f_\theta^k$ from a base image $x$ (not of class $k$), you did not have to move far along $x’ = x + \epsilon\nabla f_\theta^k$ to make $f_\theta$ classify $x’$ as $k$ (Szegedy et al., 2014; Goodfellow et al., 2015):

Depiction of the classic example from (Goodfellow et al., 2015).
Depiction of the classic example from (Goodfellow et al., 2015).

Later work has painted a geometric picture for what’s going on: for data near a low-dimensional manifold $\mathcal{M}$, the training process controls behavior in tangential directions, but does not regularize behavior in orthogonal directions, thus leading to sensitive behavior (Stutz et al., 2019). Another way stated: $f_\theta$ has a reasonable Lipschitz constant when considering only tangential directions to the data manifold $\mathcal{M}$, but can have very high Lipschitz constants in normal directions. In fact, it often benefits the model to be sharper in these normal directions, so it can fit more complicated functions more precisely.

Adversarial perturbations leave the data manifold
Adversarial perturbations leave the data manifold

As a result, such adversarial examples are incredibly common even for a single given model. Further, this is not just a computer vision phenomenon; adversarial examples also appear in LLMs (Wallace et al., 2019) and in RL (Gleave et al., 2019).

While there are methods to train for more adversarially robust models, there is a known trade-off between model performance and adversarial robustness (Tsipras et al., 2019): especially in the presence of many weakly-correlated variables, the model *must* be sharper to achieve higher performance. Indeed, most modern training algorithms, whether in computer vision or LLMs, do not train adversarial robustness out. Thus, at least until deep learning sees a major regime change, this is a problem we’re stuck with.

Why is adversarial robustness an issue for world model planning?

Consider a single component of the dynamics loss we’re optimizing in the lifted state approach:

\[\min_{s_t, a_t, s_{t+1}} \|F_\theta(s_t, a_t) - s_{t+1}\|_2^2\]

Let’s further focus on just the base state:

\[\min_{s_t} \|F_\theta(s_t, a_t) - s_{t+1}\|_2^2.\]

Since world models are typically trained on state/action trajectories $(s_1, a_1, s_2, a_2, \ldots)$, the state-data manifold for $F_{\theta}$ has dimensionality bounded by the action space:

\[\mathrm{dim}(\mathcal{M}_s) \le \mathrm{dim}(\mathcal{A}) + 1 + \mathrm{dim}(\mathcal{R}),\]

where $\mathcal{R}$ is some optional space of augmentations (e.g. translations/rotations). Thus, we can typically expect $\mathrm{dim}(\mathcal{M}_s)$ to be much lower than $\mathrm{dim}(\mathcal{S})$, and thus: it is very easy to find adversarial examples that hack any state to any other desired state.

As a result, the dynamics optimization

\[\sum_{t=0}^{T-1} \big\|F_\theta(s_t,a_t) - s_{t+1}\big\|_2^2\]

feels incredibly “sticky,” as the base points $s_t$ can easily trick $F_{\theta}$ into thinking it’s already made its local goal.1

Adversarial world model example
Adversarial world model example

1. This adversarial robustness issue, while particularly bad for lifted-state approaches, is not unique to them. Even for serial optimization methods that optimize through the full rollout map $\mathcal{F}^T$, it is possible to get into unseen states, where it is very easy to have a normal component fed into the sensitive normal components of $D_s F_{\theta}$. The action Jacobian’s chain rule expansion is

\[\Bigl(\prod_{t=1}^T D_s F_\theta(s_t, a_t)\Bigr) D_{a_0}F_\theta(s_0, a_0).\]

See what happens if any stage of the product has any component normal to the data manifold. ↩

Our fix

This is where our new planner GRASP comes in. The main observation: while $D_s F_{\theta}$ is untrustworthy and adversarial, the action space is usually low-dimensional and exhaustively trained, so $D_a F_{\theta}$ is actually reasonable to optimize through and doesn’t suffer from the adversarial robustness issue!

The action input is usually lower-dimensional and densely trained (the model has seen every action direction), so action gradients are much better behaved.
The action input is usually lower-dimensional and densely trained (the model has seen every action direction), so action gradients are much better behaved.

At its core, GRASP builds a first-order lifted state / collocation-based planner that is only dependent on action Jacobians through the world model. We thus exploit the differentiability of learned world models $F_{\theta}$, while not falling victim to the inherent sensitivity of the state Jacobians $D_s F_{\theta}$.

GRASP: Gradient RelAxed Stochastic Planner

As noted before, we start with the collocation planning objective, where we lift the states and relax dynamics into a penalty:

\[\min_{\mathbf{s},\mathbf{a}} \mathcal{L}(\mathbf{s}, \mathbf{a}) = \sum_{t=0}^{T-1} \big\|F_\theta(s_t,a_t) - s_{t+1}\big\|_2^2,

\quad \text{with } s_0 \text{ fixed and } s_T=g.\]

We then make two key additions.

Ingredient 1: Exploration by noising the state iterates

Even with a smoother objective, planning is nonconvex. We introduce exploration by injecting Gaussian noise into the virtual state updates during optimization.

A simple version:

\[s_t \leftarrow s_t - \eta_s \nabla_{s_t}\mathcal{L} + \sigma_{\text{state}} \xi, \qquad \xi\sim\mathcal{N}(0,I).\]

Actions are still updated by non-stochastic descent:

\[a_t \leftarrow a_t - \eta_a \nabla_{a_t}\mathcal{L}.\]

The state noise helps you “hop” between basins in the lifted space, while the actions remain guided by gradients. We found that specifically noising states here (as opposed to actions) finds a good balance of exploration and the ability to find sharper minima.2

2. Because we only noise the states (and not the actions), the corresponding dynamics are not truly Langevin dynamics. ↩

Ingredient 2: Reshape gradients: stop brittle state-input gradients, keep action gradients

As discussed, the fragile pathway is the gradient that flows *into the state input* of the world model, \(D_s F_{\theta}\). The most straightforward way to do this initially is to just stop state gradients into \(F_{\theta}\) directly:

  • Let $\bar{s}_t$ be the same value as $s_t$, but with gradients stopped.

Define the stop-gradient dynamics loss:

\[\mathcal{L}_{\text{dyn}}^{\text{sg}}(\mathbf{s},\mathbf{a})

= \sum_{t=0}^{T-1} \big\|F_\theta(\bar{s}_t, a_t) - s_{t+1}\big\|_2^2.\]

This alone does not work. Notice now states only follow the previous state’s step, without anything forcing the base states to chase the next ones. As a result, there are trivial minima fo

この記事をシェア

関連記事

Latent Space★52026年6月2日 12:28

[AINews] NVIDIA Cosmos 3、Nemotron 3 Ultra、RTX Spark の発表

NVIDIA は今日、言語・画像・動画・音声・動作を統合する「Cosmos 3」を発表した。同モデルは推論と生成を組み合わせたアーキテクチャを採用し、Nano や Super など複数のサイズで提供される。

TLDR AI★42026年5月29日 09:00

データは不足していない。不足しているのは想像力だ(8 分読了)

Asuka Zheng は、トレーニングデータの枯渇への不安が市場の実態を捉えていないと指摘し、自身の SRE 代替プロジェクトで世界モデルの訓練が失敗した事例を紹介する。同氏は、最初の異常から完全な解決に至るまでの長期エンドツーエンドの事象軌跡データが存在しないことがボトルネックだったと述べている。

TLDR AI★32026年5月15日 09:00

イーロン・マスク率いるスペースX AI、合併後から人材流出が相次ぐ

イーロン・マスクが率いるスペースX AIは、コーディングや世界モデル、Grok音声などの分野でトップ人材を失っており、メタ社やシンキングマシーンズラボなどの競合他社が元従業員を引き抜いている。極端な労働文化と現金化への欲求が流出の要因となっている。

ニュース一覧に戻る元記事を読む