Synced Review·2025年6月24日 18:17·約13分で読める

ByteDance、自律ロボットナビゲーション向け二重モデル「Astra」を発表

#MLLM #マルチモーダル #ロボットナビゲーション #System1-2 #ByteDance

TL;DR

ByteDance は、自律型ロボットナビゲーションの課題を解決するため、システム 1/2 パラダイムに基づく二重モデル構造「Astra」を発表し、汎用モバイルロボットの実現に向けた画期的なアプローチを示した。

AI深層分析2026年5月3日 01:12

重要/ 5段階

深度40%

キーポイント

Dual-Model Architecture の採用

従来の単一またはルールベースのモジュール群に代わり、低頻度のタスク（位置特定）を担う「Astra-Global」と高頻度のタスク（経路計画・オドメトリ）を担う「Astra-Local」からなる二重モデル構造を採用。

System 1/System 2 パラダイムの適用

人間の認知プロセスに倣い、直感的で高速な処理（Local）と論理的・推論的な処理（Global）を分離統合することで、複雑な屋内環境でのナビゲーション効率を飛躍的に向上させる。

MLLM を活用した高精度位置特定

Astra-Global はマルチモーダル大規模言語モデル（MLLM）として機能し、画像やテキストのクエリに対してハイブリッドなトポロジカル・セマンティックグラフを解析することで、QR コードなどに依存しない自己位置推定を実現する。

汎用ロボットの基盤技術としての意義

倉庫や工場など反復的な環境における従来の限界（人工ランドマークへの依存）を克服し、多様な屋内空間で自律的に動作できる一般目的型モバイルロボットの実現に向けた重要な一歩となる。

オフラインマップの構築手法

Astra は時間ダウンサンプリングされたキーフレームをノードとし、相対位置関係に基づくエッジと視覚データから抽出した意味的ランドマークを含むハイブリッドトポロジカルセマンティックグラフを構築する。

言語ベースの目標定位と二段階プロセス

自然言語指示を解釈して関連ランドマークを特定し、粗いステージで候補を絞り込み、細かいステージで視覚・位置情報を比較することで高精度なPose予測を行う。

GRPOによる学習手法の革新

SFTとGroup Relative Policy Optimization (GRPO) を組み合わせることでゼロショット汎化能力が向上し、未見の家庭環境において99.9%の定位精度を達成した。

影響分析・編集コメントを表示

影響分析

この発表は、ロボットナビゲーションの分野において、単なるアルゴリズムの改良を超え、大規模言語モデルの推論能力と高速処理能力を統合した新しいアーキテクチャパラダイムを示唆しています。特に、複雑で動的な屋内環境におけるロボットの自律性を高めることで、物流、サービス、産業用ロボットの実用範囲を大幅に拡大する可能性を秘めています。

編集コメント

既存のロボット制御手法を根本から覆す「System1-2」アプローチは、実世界でのロボットの汎用化に向けた決定的な転換点となり得る。特に言語理解と視覚定位を統合した点は、次世代自律システムの標準アーキテクチャとして注目されるべき成果である。

産業製造から日常生活に至るまで、さまざまな分野におけるロボットの統合が進むにつれ、高度なナビゲーションシステムに対する需要が高まっています。しかし、現代のロボットナビゲーションシステムは、多様で複雑な屋内環境において重大な課題に直面しており、従来のアプローチの限界が露呈しています。「今自分はどこにいるのか？」「どこへ向かっているのか？」「どうやってそこにたどり着くのか？」という根本的な問いに応えるため、ByteDance は Astra を開発しました。これは従来のナビゲーションのボトルネックを克服し、汎用型モバイルロボットを実現するために設計された革新的なデュアルモデルアーキテクチャです。

従来のナビゲーションシステムは通常、ターゲットローカライゼーション（目標地点の特定）、セルフローカライゼーション（自己位置推定）、パスプランニング（経路計画）という核心的課題を処理するために、複数の小型でしばしばルールベースのモジュールから構成されています。ターゲットローカライゼーションでは、自然言語や画像の手がかりを理解して地図上の目的地を特定します。セルフローカライゼーションでは、ロボットが地図上での正確な位置を決定する必要があり、特に倉庫のような反復的な環境では困難で、従来の手法は人工ランドマーク（例：QR コード）に依存することが多いです。パスプランニングはさらに、大まかな経路生成のためのグローバルプランニングと、リアルタイムの障害物回避や中間ウェイポイントへの到達のためのローカルプランニングに分かれます。

基盤モデルは、より小さなモデルを統合して広範なタスクに取り組む可能性を示していますが、包括的なナビゲーションのために最適なモデル数とそれらの効果的な統合方法については、依然として未解決の課題となっています。

ByteDance の Astra は、論文「Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning」（ウェブサイト：https://astra-mobility.github.io/）で詳細が述べられており、これらの限界に対処しています。System 1/System 2 パラダイム（注：直感的思考と分析的思考の二重プロセスモデル）に従い、Astra は Astra-Global と Astra-Local という 2 つの主要なサブモデルを特徴としています。Astra-Global は目標および自己位置推定のような低周波数のタスクを処理し、一方 Astra-Local は局所経路計画やオドメトリ推定（odometry estimation）のような高周波数のタスクを管理します。このアーキテクチャは、ロボットが複雑な屋内空間をナビゲーションする方法に革命をもたらすことを約束しています。

Astra-Global: 全球位置推定のための知的脳

Astra-Global は Astra アーキテクチャの知的コアであり、自己位置推定と目標位置推定といった低周波数の重要なタスクを担当します。これはマルチモーダル大規模言語モデル（MLLM: Multimodal Large Language Model）として機能し、視覚情報と言語入力の両方を処理して、地図内での正確なグローバル位置特定を実現します。その強みは、文脈入力としてハイブリッドトポロジカルセマンティックグラフを利用する点にあり、クエリ画像やテキストプロンプトに基づいてモデルが正確に位置を特定できることを可能にしています。

この堅牢な位置推定システムの構築は、オフラインマッピングから始まります。研究チームは、ハイブリッドトポロジカルセマンティックグラフ G=(V,E,L) を構築するためのオフライン手法を開発しました。

V (ノード): 入力ビデオの時間方向ダウンサンプリングと SfM（Structure from Motion）によって推定された 6 自由度（DoF: Degrees of Freedom）カメラ姿勢から得られるキーフレームが、カメラ姿勢とランドマーク参照を符号化するノードとして機能します。

E (エッジ): 無向エッジは、相対的なノード姿勢に基づいて接続性を確立し、グローバルな経路計画に不可欠です。

L (ランドマーク): セマンティックランドマーク情報は、Astra-Global が各ノードにおける視覚データから抽出するもので、地図のセマンティック理解を強化します。これらのランドマークはセマンティック属性を保持しており、共視関係を通じて複数のノードと接続されています。

実用的な位置特定において、Astra-Global の自己位置推定および目標位置特定機能は、視覚言語に基づく位置特定のための粗い段階から細かい段階へと進む二段階プロセスを活用しています。粗い段階では入力画像と位置特定プロンプトを分析し、ランドマークを検出し、事前に構築されたランドマークマップとの対応関係を確立し、視覚的一貫性に基づいて候補をフィルタリングします。その後、細かい段階ではクエリ画像と粗い段階の出力を用いてオフラインマップから参照マップノードをサンプリングし、それらの視覚情報および位置情報を比較することで、予測されたポーズを直接出力します。

言語ベースの目標位置特定においては、モデルは自然言語指示を解釈し、マップ内の機能的記述を用いて関連するランドマークを特定した後、ランドマークとノードとの対応付けメカニズムを活用して関連ノードを検出し、目標画像および 6-DoF（6 自由度）ポーズを取得します。

Astra-Global に堅牢な位置特定能力を持たせるために、チームは綿密なトレーニング手法を採用しました。バックボーンとして Qwen2.5-VL を使用し、教師あり微調整（Supervised Fine-Tuning: SFT）とグループ相対政策最適化（Group Relative Policy Optimization: GRPO）を組み合わせました。SFT には、粗い位置特定と細かい位置特定、共視検出、運動傾向推定など多様なタスクに対応する多様なデータセットが用いられました。GRPO フェーズでは、フォーマット、ランドマーク抽出、マップマッチング、追加のランドマーク報酬を含むルールベースの報酬関数を用いて、視覚言語による位置特定のためのトレーニングが行われました。実験により、GRPO は Astra-Global のゼロショット汎化能力を大幅に向上させ、未見の家庭環境において 99.9% の位置特定精度を達成し、SFT のみの手法を上回りました。

Astra-Local: ローカルプランニングのためのインテリジェントアシスタント

Astra-Local は、Astra の高頻度タスクに対するインテリジェントアシスタントとして機能するマルチタスクネットワークであり、センサーデータから効率的にローカルパスを生成し、オドメトリを正確に推定する能力を備えています。そのアーキテクチャは、4 次元時空間エンコーダ、プランニングヘッド、およびオドメトリヘッドの 3 つのコアコンポーネントで構成されています。

4D 時空間エンコーダーは、従来のモバイルスタックの知覚および予測モジュールを置き換えます。まず、3D 空間エンコーダーが N 枚の全方向画像を処理し、Vision Transformer (ViT) および Lift-Splat-Shoot を用いて 2D 画像特徴量を 3D ボクセル特徴量へ変換します。この 3D エンコーダーは、3D 体積微分可能なニューラルレンダリングを用いた自己教師あり学習によって訓練されます。次に、4D 時空間エンコーダーはこの 3D エンコーダーを基盤とし、過去のボクセル特徴量と未来のタイムスタンプを入力として受け取り、ResNet および DiT モジュールを通じて未来のボクセル特徴量を予測します。これにより、プランニングおよびオドメトリのための現在および将来の環境表現が提供されます。

プランニングヘッドは、事前学習された 4D 特徴量、ロボットの速度、タスク情報に基づき、Transformer ベースのフローマッチングを用いて実行可能な軌道生成を行います。衝突防止のため、プランニングヘッドはマスク付き ESDF ロス（Euclidean Signed Distance Field）を組み込んでいます。このロスは 3D 占有マップの ESDF を計算し、2D のグランドトゥルース軌道マスクを適用することで、衝突率を大幅に低減します。実験により、他の手法と比較して、分布外 (OOD) データセットにおける衝突率および総合スコアにおいて優れた性能を示すことが実証されています。

オドメトリヘッドは、現在の過去 4 次元特徴量および追加のセンサーデータ（例：IMU、車輪データ）を使用してロボットの相対姿勢を予測します。これは、異なるセンサーからの情報を融合させるために Transformer モデルを訓練します。各センサーモダリティは特定のトークナイザーによって処理され、モダリティ埋め込みと時間的位置埋め込みと結合され、Transformer エンコーダーに供給された後、最終的に CLS トークンを使用して相対姿勢を予測します。実験により、オドメトリヘッドがマルチセンサー融合および姿勢推定において優れた性能を示し、回転精度が大幅に向上し、全体的な軌道誤差が削減されることが示されました。

実験的検証

Astra の性能を包括的に評価するために、多様な屋内環境（倉庫、オフィス、家庭）で広範な実験が行われました。

Astra-Global のマルチモーダル位置特定能力は、さまざまな実験を通じて検証され、テキストおよび画像の位置特定クエリを処理する際の卓越したパフォーマンスが示されました。ターゲット位置特定においては、テキストコマンド（例：「休憩エリアを見つける」）に基づいて一致する画像と姿勢を正確に識別します。従来の Visual Place Recognition (VPR) 手法と比較して、Astra-Global は以下の点で顕著な優位性を示しています:

詳細キャプチャ: VPR がグローバル特徴量に依存するのに対し、Astra-Global は部屋番号などの微細な詳細を正確に捉え、類似シーンにおける位置特定エラーを防ぎます。

視点ロバストネス：セマンティックランドマークに基づき、Astra-Global はカメラアングルの大規模な変化があっても安定した位置推定を維持します。これは通常、VPR（視覚的位置認識）手法が失敗する状況です。

姿勢精度：Astra-Global はランドマークの空間関係を活用して最適なマッチング姿勢を選択し、従来の VPR よりも著しく高い姿勢精度を示します（倉庫環境では 30% 以上の改善）。具体的には、距離誤差が 1 メートル以内、角度誤差が 5 度以内に収まります。

Astra-Local のプランニングヘッドとオドメトリヘッドは徹底的に評価されました。Transformer ベースのフローマッチングとマスク付き ESDF 損失（ESDF: Expanded Signed Distance Field）を採用したプランニングヘッドは、OOD データセット上で ACT や拡散ポリシーなどの手法を上回り、衝突率、速度、総合スコアにおいて優れた性能を示しました。これは、マスク付き ESDF 損失が衝突リスクを軽減する効果が高いことを示しています。

オドメトリヘッドの性能は、同期された画像シーケンス、IMU（慣性計測ユニット）、車輪データ、およびグランドトゥルースポーズを含むマルチモーダルデータセット上で評価されました。2フレームベースラインのBEV-ODOMと比較して、Astra-Local のオドメトリヘッドは、マルチセンサー融合と姿勢推定において顕著な優位性を示しました。IMU データを統合することで回転推定の精度が劇的に向上し、全体の軌道誤差は約 2% に低減されました。さらに車輪データを含めることでスケールの安定性と推定精度が強化され、その優れたマルチセンサーデータ融合能力が実証されました。

Astra は将来の開発と応用において大きな可能性を秘めています。その展開は、大規模なショッピングモール、病院、図書館などより複雑な屋内環境へと拡大でき、そこでは製品の正確な位置特定、効率的な医療資材の配送、書籍の整理などのタスクを支援することが可能です。

しかし、改善すべき領域も存在します。Astra-Global において、現在のマップ表現は情報損失とトークン長のバランスを保っていますが、時として重要な意味論的詳細が欠ける場合があります。今後の研究では、効率性を最適化しつつ意味論的情報の保持を最大化するために、代替的なマップ圧縮手法に焦点を当てます。さらに、現在の単一フレームによる位置特定は、特徴が少ない環境や非常に反復的な環境では失敗する可能性があります。今後の計画には、より堅牢な位置特定を実現するための能動的探索メカニズムと時間的推論の導入が含まれています。

Astra-Local においては、分布外 (OOD) シナリオに対する堅牢性の向上が極めて重要であり、モデルアーキテクチャとトレーニング手法の強化が必要です。また、システム安定性を高めるために、フォールバックシステムの再設計を行い、より密接な統合とシームレスな切り替えを実現する計画も進められています。さらに、指示従順機能の統合により、ロボットが自然言語コマンドを理解して実行できるようになり、動的で人間中心の環境における利用可能性が拡大し、より自然な人間 - ロボット間の相互作用を促進します。

本記事「ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous Robot Navigation」は、Synced に最初に掲載されました。

原文を表示

The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for advanced navigation systems. However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the limitations of traditional approaches. Addressing the fundamental questions of “Where am I?”, “Where am I going?”, and “How do I get there?”, ByteDance has developed Astra, an innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile robots.

Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of target localization, self-localization, and path planning. Target localization involves understanding natural language or image cues to pinpoint a destination on a map. Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes). Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and reaching intermediate waypoints.

While foundation models have shown promise in integrating smaller models to tackle broader tasks, the optimal number of models and their effective integration for comprehensive navigation remained an open question.

ByteDance’s Astra, detailed in their paper “Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning” (website: https://astra-mobility.github.io/), addresses these limitations. Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local. Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local path planning and odometry estimation. This architecture promises to revolutionize how robots navigate complex indoor spaces.

Astra-Global: The Intelligent Brain for Global Localization

Astra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks: self-localization and target localization. It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global positioning within a map. Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions based on query images or text prompts.

The construction of this robust localization system begins with offline mapping. The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):

V (Nodes): Keyframes, obtained by temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and landmark references.

E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.

L (Landmarks): Semantic landmark information is extracted by Astra-Global from visual data at each node, enriching the map’s semantic understanding. These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.

In practical localization, Astra-Global’s self-localization and target localization capabilities leverage a coarse-to-fine two-stage process for visual-language localization. The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark map, and filters candidates based on visual consistency. The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and positional information to directly output the predicted pose.

For language-based target localization, the model interprets natural language instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.

To empower Astra-Global with robust localization abilities, the team employed a meticulous training methodology. Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend estimation. In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used to train for visual-language localization. Experiments showed GRPO significantly improved Astra-Global’s zero-shot generalization, achieving 99.9% localization accuracy in unseen home environments, surpassing SFT-only methods.

Astra-Local: The Intelligent Assistant for Local Planning

Astra-Local acts as the intelligent assistant for Astra’s high-frequency tasks, a multi-task network capable of efficiently generating local paths and accurately estimating odometry from sensor data. Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.

The 4D spatio-temporal encoder replaces traditional mobile stack perception and prediction modules. It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to convert 2D image features into 3D voxel features. This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering. The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.

The planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using Transformer-based flow matching. To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field). This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates. Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other methods.

The odometry head predicts the robot’s relative pose using current and past 4D features and additional sensor data (e.g., IMU, wheel data). It trains a Transformer model to fuse information from different sensors. Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddings, fed into a Transformer encoder, and finally uses a CLS token to predict relative pose. Experiments showed the odometry head’s excellent performance in multi-sensor fusion and pose estimation, significantly improving rotational accuracy and reducing overall trajectory error.

Experimental Validation

Extensive experiments were conducted in diverse indoor environments (warehouses, offices, homes) to comprehensively evaluate Astra’s performance.

Astra-Global’s multimodal localization capabilities were validated through various experiments, demonstrating superior performance in handling text and image localization queries. For target localization, it accurately identifies matching images and poses based on text commands (e.g., “find the resting area”). Compared to traditional Visual Place Recognition (VPR) methods, Astra-Global exhibits significant advantages in:

Detail Capture: Unlike VPR’s reliance on global features, Astra-Global precisely captures fine details like room numbers, preventing localization errors in similar scenes.

Viewpoint Robustness: Based on semantic landmarks, Astra-Global maintains stable localization even with large camera angle changes, where VPR methods typically fail.

Pose Accuracy: Astra-Global leverages landmark spatial relationships to select the best matching pose, showing significantly higher pose accuracy (within 1-meter distance error and 5-degree angular error) than traditional VPR, with over 30% improvement in warehouse environments.

Astra-Local’s planning and odometry heads were thoroughly evaluated. The planning head, using Transformer-based flow matching and masked ESDF loss, outperformed methods like ACT and diffusion policies in collision rate, speed, and overall score on OOD datasets. This highlights the masked ESDF loss’s effectiveness in mitigating collision risks.

The odometry head’s performance was assessed on multimodal datasets including synchronized image sequences, IMU, wheel data, and ground truth poses. Compared to two-frame BEV-ODOM baselines, Astra-Local’s odometry head showed significant advantages in multi-sensor fusion and pose estimation. Integrating IMU data dramatically improved rotational estimation accuracy, reducing overall trajectory error to approximately 2%. Further inclusion of wheel data enhanced scale stability and estimation accuracy, validating its superior multi-sensor data fusion capabilities.

Astra holds significant promise for future development and applications. Its deployment can be expanded to more complex indoor environments like large shopping malls, hospitals, and libraries, where it can assist in tasks such as precise product location, efficient medical supply delivery, and book organization.

However, areas for improvement exist. For Astra-Global, while current map representations balance information loss and token length, they may occasionally lack critical semantic details. Future work will focus on alternative map compression methods to optimize efficiency while maximizing semantic information retention. Additionally, current single-frame localization can fail in feature-scarce or highly repetitive environments; future plans include active exploration mechanisms and temporal reasoning for more robust localization.

For Astra-Local, improving robustness to out-of-distribution (OOD) scenarios is crucial, requiring enhanced model architectures and training methods. Redesigning the fallback system for tighter integration and seamless switching is also planned to improve system stability. Furthermore, integrating instruction-following capabilities will enable robots to understand and execute natural language commands, expanding their usability in dynamic, human-centric environments and fostering more natural human-robot interaction.

The post ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous Robot Navigation first appeared on Synced.

この記事をシェア

TLDR AI★42026年6月24日 09:00

OpenAI、ChatGPT向け双方向音声モードの展開を準備

OpenAIは、アシスタントが同時に話しかけ、聞き取り、応答できる新音声生成モデル「Bidi 1」をChatGPTに導入し、会話の流れを維持しながら中断時に即座にタスクを切り替える機能をロールアウトしている。

AWS Machine Learning Blog★42026年6月16日 05:24

Amazon Bedrock に Google DeepMind の「Gemma 4」モデルシリーズが追加

Google DeepMind が開発したオープンウェイトの AI モデル「Gemma 4」シリーズ（31B、26B-A4B、E2B）が、Amazon Bedrock で利用可能になった。

Vercel Blog★42026年6月12日 09:00

Moonshot AI の「Kimi K2.7 Code」が Vercel AI Gateway で利用可能に

Vercel は、Moonshot AI が開発した長期的コーディングタスク対応の多機能モデル「Kimi K2.7 Code」を自社の AI Gateway に追加し、テキストと画像の入力を同時に処理できる機能を公開しました。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む