MIT ML News·2026年3月11日 13:00·約6分

複雑な視覚タスク計画のための改良手法

#ビジョン言語モデル(VLM)#自動計画生成 #古典的計画ソルバー #自律ロボット制御

TL;DR

MITの研究者チームは、ビジョン言語モデルと古典的な計画ソルバーを組み合わせる二段階のAIフレームワークを開発し、複雑な視覚タスクの長期計画成功率を約70%に向上させた。

AI深層分析2026年3月11日 14:41

重要/ 5段階

深度40%

キーポイント

二段階アーキテクチャの構築

ビジョン言語モデル(VLM)で視覚情報を認識・シミュレーションし、別のモデルがそれを計画言語に変換して精密化する自動パイプラインを実現した。

高い成功率と未知環境への対応

既存手法の約30%に対し成功率を約70%に向上させ、動的に変化する実環境や未体験の問題にも柔軟に対応可能である。

視覚認識と論理計画のギャップ解消

VLMの空間関係理解の弱点と、古典的ソルバーの入力制約を自動補完し、専門家の手動エンコーディングを不要にした。

実世界応用への適性

ロボットナビゲーションや自律走行など、長期の視覚タスクを必要とする現実世界のシステム実装に直接活用できる設計となっている。

影響分析・編集コメントを表示

影響分析

この手法は、AIの「知覚」と「論理計画」の分野で長年存在した技術的ギャップを埋める画期的なアプローチである。実環境での動的変化に対応可能な高成功率は、自律型ロボットや自動運転システムの信頼性と実用性を大幅に高める可能性を秘めている。今後は計算コストの最適化と大規模実証実験が課題となるが、AIの現実世界適用における重要なマイルストーンとなるだろう。

編集コメント

視覚認識と論理推論をシームレスに連携させた本フレームワークは、AIの「知覚」と「計画」の統合における重要な一歩である。実環境での適用可能性を裏付ける70%という成功率は、自律型システムの次の段階へ向けて現実的な期待を抱かせる。

タイトル: 複雑な視覚タスク計画のためのより優れた手法

MITの研究者らは、ロボットナビゲーションのような長期的な視覚タスクを計画するための、生成人工知能（AI）駆動のアプローチを開発しました。この手法は、既存のいくつかの技術と比べて約2倍の効果を発揮します。

この手法は、専門化された視覚言語モデル（Vision-Language Model, VLM）を用いて画像内の状況を認識し、目標に到達するために必要な一連の行動をシミュレートします。次に、第2のモデルがそのシミュレーション結果を計画問題用の標準的なプログラミング言語に変換し、解を洗練させます。

最終的に、このシステムは古典的な計画ソフトウェアに入力できる一連のファイルを自動生成します。そのソフトウェアが目標達成のための計画を計算します。この2段階システムは平均成功率約70％の計画を生成し、成功率約30％に留まった最高のベースライン手法を上回りました。

重要な点は、このシステムが遭遇したことのない新たな問題も解決できることであり、状況が一瞬で変化する可能性のある現実環境への適用に適しています。

「我々のフレームワークは、画像理解能力のような視覚言語モデルの利点と、形式的ソルバーの強力な計画能力を組み合わせたものです」と、この技術に関するオープンアクセス論文の筆頭著者であり、MITの航空宇宙工学（AeroAstro）大学院生のYilun Haoは述べています。「単一の画像を取り込み、シミュレーションを経て、多くの実用的なアプリケーションで有用となり得る、信頼性の高い長期的計画へと導くことができます。」

共著者は、MIT情報意思決定システム研究所（Laboratory for Information and Decision Systems, LIDS）の大学院生Yongchao Chen氏、AeroAstro准教授でLIDSの主任研究員であるChuchu Fan氏、MIT-IBM Watson AI Labの研究員Yang Zhang氏です。本論文は国際学習表現会議（International Conference on Learning Representations）で発表される予定です。

視覚タスクへの取り組み

過去数年間、Fan氏とその同僚は、複雑な推論と計画を実行するための生成AIモデルの利用を研究しており、多くの場合、大規模言語モデル（Large Language Model, LLM）を用いてテキスト入力を処理してきました。

ロボット組み立てや自律走行など、現実世界の多くの計画問題には、LLM単体では適切に処理できない視覚入力が存在します。研究者らは、画像とテキストを処理できる強力なAIシステムである視覚言語モデル（VLM）を活用することで、視覚領域への応用を目指しました。

しかし、VLMはシーン内の物体間の空間関係を理解するのが難しく、複数ステップにわたる推論を正しく行えないことが多々あります。このため、VLMを長期的計画に直接用いることは困難です。

一方で、研究者らは複雑な状況に対して効果的な長期的計画を生成できる、堅牢な形式的プランナーを開発してきました。しかし、これらのソフトウェアシステムは視覚入力を直接処理できず、問題をソルバーが理解できる言語に変換（エンコード）するための専門知識を必要とします。

Fan氏のチームは、両手法の長所を組み合わせた自動計画システムを構築しました。VLM誘導形式的計画（VLM-guided formal planning, VLMFP）と呼ばれるこのシステムは、視覚計画問題を形式的計画ソフトウェアで即座に使用可能なファイルへ変換するために連携する、2つの専門化されたVLMを利用します。

研究者らはまず、SimVLMと呼ぶ小型モデルを注意深く訓練し、自然言語を用いて画像内の状況を記述し、その状況下での一連の行動をシミュレートすることに特化させました。次に、GenVLMと呼ぶはるかに大規模なモデルが、SimVLMによる記述を用いて、Planning Domain Definition Language（PDDL）として知られる形式的計画言語で一連の初期ファイルを生成します。

これらのファイルは、タスク解決のための段階的計画を計算する古典的PDDLソルバーへそのまま入力できます。GenVLMはソルバーの出力結果とシミュレーターの結果を比較し、PDDLファイルを反復的に洗練させます。

「生成器とシミュレーターが協調することで、目標を達成する行動シミュレーションという、まったく同じ結果に到達できるようになります」とHao氏は述べています。

GenVLMは大規模な生成AIモデルであるため、訓練過程でPDDLの多くの事例に触れ、この形式的言語がどのように多様な問題を解決するかを学習しています。この事前知識により、モデルは正確なPDDLファイルを生成できるのです。

柔軟なアプローチ

VLMFPは2つの独立したPDDLファイルを生成します。1つは環境、有効な行動、領域のルールを定義する「ドメインファイル」です。もう1つは、当面の特定の問題における初期状態と目標を定義する「問題ファイル」です。

「PDDLの利点の一つは、同じ環境内ではドメインファイルが全てのインスタンスで共通であることです。これにより、我々のフレームワークは同じ領域内の未見のインスタンスへもうまく一般化できます」とHao氏は説明します。

システムが効果的に一般化できるようにするため、研究者らはSimVLMが状況内のパターンを単に記憶するのではなく、問題と目標を理解することを学ぶのに「ちょうど十分な」訓練データを注意深く設計する必要がありました。テストでは、SimVLMは実験の約85％において、状況の記述、行動のシミュレーション、目標達成の検知に成功しました。

全体として、VLMFPフレームワークは、6種類の2次元（2D）計画タスクで約60％、マルチロボット協調やロボット組み立てを含む2種類の3次元（3D）タスクでは80％を超える成功率を達成しました。また、以前に経験したことのないシナリオの50％以上に対して有効な計画を生成し、ベースライン手法を大きく上回りました。

「我々のフレームワークは、状況が変わってルールが異なっても一般化できます。これにより、当システムは多種多様な視覚ベースの計画問題を解決する柔軟性を備えています」とFan氏は付け加えます。

今後、研究者らはVLMFPがより複雑なシナリオを処理できるようにするとともに、VLMによる幻覚（ハルシネーション）を検出・軽減する方法を探求したいと考えています。

「長期的には、生成AIモデルがエージェントとして機能し、適切なツールを活用してはるかに複雑な問題を解決する可能性があります。しかし、『適切なツールを持つ』とは具体的にどういう意味で、それらのツールをどのように統合すればよいのでしょうか？道のりはまだ長いですが、視覚ベースの計画をこの全体像に組み込む本研究成果は、パズルの重要な一片なのです」とFan氏は述べています。

この研究は、MIT-IBM Watson AI Labから一部資金提供を受けました。

原文を表示

MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that is about twice as effective as some existing techniques.

Their method uses a specialized vision-language model to perceive the scenario in an image and simulate actions needed to reach a goal. Then a second model translates those simulations into a standard programming language for planning problems, and refines the solution.

In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods that could only reach about 30 percent.

Importantly, the system can solve new problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice.

“Our framework combines the advantages of vision-language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open-access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long-horizon plan that could be useful in many real-life applications.”

She is joined on the paper by Yongchao Chen, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations.

Tackling visual tasks

For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well on its own. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that can process images and text.

But VLMs struggle to understand spatial relationships between objects in a scene and often fail to reason correctly over many steps. This makes it difficult to use VLMs for long-range planning.

On the other hand, scientists have developed robust, formal planners that can generate effective long-horizon plans for complex situations. However, these software systems can’t process visual inputs and require expert knowledge to encode a problem into language the solver can understand.

Fan and her team built an automatic planning system that takes the best of both methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software.

The researchers first carefully trained a small model they call SimVLM to specialize in describing the scenario in an image using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal planning language known as the Planning Domain Definition Language (PDDL).

The files are ready to be fed into a classical PDDL solver, which computes a step-by-step plan to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files.

“The generator and simulator work together to be able to reach the exact same result, which is an action simulation that achieves the goal,” Hao says.

Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.

A flexible approach

VLMFP generates two separate PDDL files. The first is a domain file that defines the environment, valid actions, and domain rules. It also produces a problem file that defines the initial states and the goal of a particular problem at hand.

“One advantage of PDDL is the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains.

To enable the system to generalize effectively, the researchers needed to carefully design just enough training data for SimVLM so the model learned to understand the problem and goal without memorizing patterns in the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments.

Overall, the VLMFP framework achieved a success rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for more than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods.

“Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual-based planning problems,” Fan adds.

In the future, the researchers want to enable VLMFP to handle more complex scenarios and explore methods to identify and mitigate hallucinations by the VLMs.

“In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual-based planning into the picture, this work is an important piece of the puzzle,” Fan says.

This work was funded, in part, by the MIT-IBM Watson AI Lab.

この記事をシェア

MIT ML News重要度42026年7月1日 00:30

Q&A：現在のエージェント型 AI とあるべき姿とは何か

MIT ML News2026年6月30日 04:00

MIT 音楽技術研究ショーケースが新大学院生たちの成果を祝う

MIT ML News2026年6月30日 03:00

データ駆動型美学を超えて：MIT の展示会が探る計算と創造の融合

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む