NVIDIA Developer Blog·2026年6月1日 13:49·約11分で読める

NVIDIA アルパマイヨを用いたクローズドループでの自動運転モデルのポストトレーニング手法

#Autonomous Vehicles #Vision-Language-Action (VLA)#Reinforcement Learning #Simulation #NVIDIA Alpamayo

TL;DR

NVIDIA は、自律走行モデルのトレーニングと実運用のギャップを埋めるため、Alpamayo の一部である AlpaGym を用いたクローズドループ事後トレーニング手法を公開した。

AI深層分析2026年6月11日 23:11

重要/ 5段階

深度40%

キーポイント

オープンループからクローズドループへの移行の必要性

従来の VLA モデルは環境への影響を考慮しないオープンループで訓練されるが、実運用では誤差が蓄積するリスクがあるため、クローズドループでのトレーニングが不可欠である。

NVIDIA Alpamayo と AlpaGym の役割

Alpamayo は AI モデルとシミュレーションフレームワークのオープンポートフォリオであり、その中の AlpaGym がシミュレータからのフィードバックを直接トレーニングループに組み込むクローズドループ環境を提供する。

実装プロセスの具体化

本記事では、AlpaGym のインストール設定、クローズドループ報酬の定義、トレーニングの実行、および下流タスク用のチェックポイントエクスポートという具体的な手順を解説している。

シミュレーション評価から学習への転換

AlpaGym はシミュレーションを最終的な評価段階として扱うだけでなく、ロールアウトデータをトレーニング経験に変換することで、より現実的な運転ポリシーの開発を可能にする。

AlpaGym を用いたエンドツーエンドのポストトレーニングワークフロー

画像は、データ収集、クローズドループシミュレーション、運転モデル、ポリシー学習、そしてオーケストレーションを含む、NVIDIA Alpamayo などの運転モデルを AlpaGym で強化学習ベースでポストトレーニングする完全なプロセスを示しています。

クローズドループシミュレーションによる効率的な学習

AlpaGym は、モデルの出力が次の環境状態にフィードバックされるクローズドループ形式で動作し、より現実的な運転シナリオでのモデル改善を可能にします。

閉ループ学習の重要性

閉ループ訓練では、モデル自身の行動の結果から学習するため、ブレーキや操舵などの判断が環境の状態に連鎖し、静的データセットでは見逃されやすい失敗モードを特定できる。

影響分析・編集コメントを表示

影響分析

この記事は、自律走行技術の開発パラダイムを「シミュレーションによる評価」から「シミュレーション内での学習」へと転換させる重要な指針を示しています。特に、現実世界の複雑な状況下でモデルの信頼性を高めるためにクローズドループトレーニングが不可欠であることを明確にし、開発者がより実用的な運転ポリシーを構築するための具体的なツールと手順を提供することで、業界全体の開発効率と安全性向上に寄与します。

編集コメント

自律走行開発における「シミュレーションと実運用のギャップ」を埋める具体的な手法が提示されており、技術的な深みと実用性の両面で非常に価値のある記事です。特に AlpaGym のクローズドループ機能は、モデルの安全性検証において今後重要な役割を果たすでしょう。

自律走行車（AV）のポリシーを開発するには、トレーニングと展開の間にある重要なギャップを埋める必要があります。より複雑な運転シーンを推論し、より豊かな中間推論を生み出すことができるビジョン・言語・アクション（VLA: Vision-language-action）モデルは、主にオープンループで訓練されます。これは、環境への影響を考慮せずにモデルの出力が直接グランドトゥルースの行動と比較される状態です。

一方、展開時には運転ポリシーはクローズドループで動作します。ここでは、すべてのブレーキ、ステアリング、ナビゲーションの決定が環境に影響を与え、小さなエラーが時間とともに蓄積する可能性があります。

この課題に対処するための体系的な手段として、NVIDIA Alpamayo が提供されます。これは、AV 開発のための AI モデル、シミュレーションフレームワーク、および物理的 AI（physical AI）データセットのオープンポートフォリオです。Alpamayo には、AV シミュレーションプラットフォームである AlpaSim と、クローズドループトレーニングフレームワークである AlpaGym（近日公開予定）が含まれています。

本記事では、NVIDIA Alpamayo を用いて AV モデルをクローズドループで訓練する方法について説明します。具体的には、以下の手順を追跡します：

AlpaGym のインストールと設定
クローズドループ報酬の定義
クローズドループトレーニングの実行
下流処理での使用のためにポストトレーニングされたチェックポイントをエクスポート

AlpaGym を用いたクローズドループ事後トレーニングは、シミュレーションでのロールアウトをトレーニング経験に変換することで、自律走行車（AV）のトレーニングワークフローを拡張します。シミュレーションを単なる最終評価段階として扱うのではなく、AlpaGym はシミュレータからのフィードバックをポリシートレーニングループに直接接続します。

image*図 1. AlpaGym を使用して Alpamayo などの運転モデルを事後トレーニングするためのエンドツーエンドのワークフロー*

クローズドループ強化学習における AlpaGym の使用方法

強化学習（RL）は、オープンループで最初にトレーニングされたポリシーを改善するために使用できます。単に記録されたエキスパートの軌道に対して最適化するのではなく、モデルはシミュレーション内での自身の行動の結果から学習できるようになります。

この転換は、小さな予測や計画ミスが時間とともに蓄積する可能性がある自律走行車の開発において極めて重要です。クローズドループトレーニングでは、各ブレーキ、ステアリング、ナビゲーションの決定が環境の次の状態に影響を与え、静的なデータセットやオープンループ評価では見逃される可能性のある失敗モードを明らかにします。

しかし、クローズドループ強化学習（RL）を可能にすることには独自の課題が伴います。モデル推論の実行、シミュレーションの走査、モデルのトレーニング、重み更新の同期、インスタンス間での通信、データ転送——これらすべてを並列で処理することは複雑です。これには、堅牢かつ柔軟な方法によるオーケストレーションと、計算リソースの効率的な活用が必要です。

image*図 2. AlpaGym は大規模なクローズドループトレーニングを可能にし、運転モデルがさまざまなシミュレーションシナリオにおいて自らの行動の結果から学習できるようになります。これにより、トレーニングとデプロイメントの間の乖離が大幅に縮小されます*

これらの課題に対処するため、AlpaGym はポリシートレーニングを AlpaSim のクローズドループロールアウトに接続し、クローズドループ RL 用のオープンソースかつ高スループットなフレームワークを提供します。このシステムは、AlpaSim シミュレータマイクロサービス、NVIDIA Physical AI オープンデータセット、および分散型 NVIDIA Cosmos-RL トレーニングフレームワークを組み合わせ、スケーラブルなポストトレーニングパイプラインを実現しています。

AlpaGym は、単一の GPU からマルチノード GPU クラスターまでシームレスにスケールするように設計されており、ユーザーコードの変更を必要とせずに、非同期かつ安定した分散強化学習（RL）パイプラインを通じて効率的な大規模トレーニングをサポートします。ランタイムおよびオーケストレーション層として AlpaSim と Cosmos RL を統合し、デフォルトアルゴリズムとして GRPO を採用しており、Alpamayo モデルおよび Physical AI AV NuRec データセットで検証された参照用報酬関数も含まれています。

AlpaGym のポストトレーニングを開始するには、以下の手順に従ってください。

ステップ 1: AlpaGym のインストールと設定

Alpamayo チェックアウトから AlpaGym をインストールするには、ホスト上でネイティブ CUDA 依存関係および Redis をインストールし、その後 UV ワークスペースを同期します：

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server git-lfs

git lfs install

git lfs pull

huggingface-cli login

または export HF_TOKEN=...

uv sync --all-packages

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server

uv sync --all-packages

Python 環境は uv によって管理されますが、cuDNN、NCCL、および redis-server バイナリは、CUDA モデルスタックおよび Cosmos-RL で使用されるホスト依存関係です。また、適切な Dockerfile も用意されています。シーンアーティファクトのダウンロードには Hugging Face の認証が必要です。

AlpaGym の実行は Hydra 設定ファイルです。これは、ポリシーのチェックポイント、AlpaSim シーンセット、ロールアウト並列処理、報酬関数、および Cosmos-RL のトレーニングパラメータを指定します。このワークフローでは、開始チェックポイントは Alpamayo モデルとなります。

image*図 3. AlpaGym クローズドループ事後トレーニングでは、ホストプロセスが AlpaSim を起動し、ロールアウトワーカーはポリシードライバを公開し、AlpaSim がシミュレータセッションを実行し、AlpaGym はロールアウト成果物と報酬をトレーナーに返します*

ステップ 2: クローズドループ報酬の定義

報酬は、クローズドループで改善したい行動に一致させる必要があります。軌道品質に関する事後トレーニングの場合、一般的な報酬項には、進行状況、車線維持、衝突回避、路外走行率、快適性、および参照軌道からの距離が含まれます。

実用的な最初の報酬は意図的にシンプルにするべきです：進行状況と安全性に致命的な失敗に対するペナルティを組み合わせたものです。AlpaGym では、可能な限り AlpaSim のメトリクス（metrics）を使用し、いくつかの項を単純に合計することでこれを表現できます。

reward/progress_safety.yaml

terms:

kind: metric

metric_name: progress

scale: 1.0

kind: metric

metric_name: collision_any

scale: -10.0

kind: metric

metric_name: offroad

scale: -5.0

パイプラインが安定したら、AlpaSim の動画およびメトリクスで観測された故障モードに対応した、よりターゲットを絞った項を追加します。

ステップ 3: クローズドループ事後トレーニングの起動

モデルのチェックポイントから AlpaGym のトレーニングを開始します。ここでは Alpamayo が例示モデルとして機能します。

uv run -m alpagym_host.cli \

policy=alpamayo \

policy.model.kind=alpamayo_r1 \

policy.model.path=/path/to/checkpoint \

reward=progress_safety

これにより、単一の GPU で AlpaGym と AlpaSim が起動します。ご自身の AV モデルの使用方法に関する詳細な手順については、追ってご案内いたします。

トレーニング中、AlpaGym は AlpaSim からシーンロールアウトを要求し、エピソードごとのアーティファクトを収集し、報酬を計算し、ポリシーを更新します。有用なトレーニングシグナルには、平均報酬、報酬の分散、失敗率、ポリシー損失、ロールアウトスループット、生成されたロールアウトと最新のポリシー重みとの間のギャップが含まれます。

このレシピでは、これらのロールアウトアーティファクトとトレーニングシグナルがポストトレーニング実行の主要な出力となります。これらは、クローズドループ学習が正しく実行されていることを確認し、独自に保持した AlpaSim シナリオスイートでのダウンストリーム評価用にチェックポイントを選択するのに役立ちます。

ステップ 4: ポストトレーニング済みチェックポイントのエクスポート

トレーニング完了後、AlpaGym が生成したチェックポイントおよび設定ファイルを、AlpaSim ドライバーがアクセスできるフォルダー（例：Hugging Face モデルキャッシュ）に配置します。その後、そのフォルダーパスを指定する新しいドライバー設定を作成します（ここでは alpamayo1_CLRL と呼びます）。ドライバー YAML 設定でカスタムパスを指定するために編集すべき内容については、以下のコードをご覧ください。これにより、AlpaGym のポストトレーニング済みポリシーが AlpaSim 内で実行可能となり、クローズドループロールアウトが可能になります。

...

model:

model_type: alpamayo1

checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN"

device: "cuda"

...

次に、エクスポートしたモデルを実行して、ポリシー（policy）、ドライバー、シミュレーションループが正しく接続されているかを確認します。この段階では、環境の次の状態に自身の行動がどのように影響を与えるかを観察し、ポリシーの挙動を検証できます。

uv run alpasim_wizard deploy=local topology=1gpu \

driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL \

scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

クローズドループのロールアウト（rollout）からは、モデルが安定した軌道を生み出しているか、走行可能領域内に留まっているか、近隣の交通エージェントに対してどのように反応するか、そしてポストトレーニングでどの失敗モードを優先的に改善すべきかなど、有用な定性的なシグナルを得ることができます。

*Video 1. AV モデルの AlpaSim クローズドループロールアウト。レンダリングされたカメラビュー、予測軌道、ロールアウトレベルの診断情報を含む*

このチェックポイントを使用することで、チームはトレーニング中に収集されたロールアウト動画、エピソードごとのメトリクス、報酬の推移（reward traces）、および失敗事例を検証できます。これらのアーティファクトは、報酬設計のデバッグ、ロールアウトの安定性確認、および後続の AlpaSim におけるホールドアウト評価用のチェックポイント選定に役立ちます。

AV モデルのポストトレーニング開始

クローズドループ事後トレーニングは、エンドツーエンドの運転ポリシーを反復するための実用的な道筋を提供します。この場合、AlpaGym はシミュレーション内で AV ポリシーを事後トレーニングするためにクローズドループロールアウトを使用し、行動の結果から学習できるようにします。

これらのツールを NVIDIA Alpamayo Open Platform の他のコンポーネントと組み合わせて使用することで、クローズドループシミュレーションワークフローで実行・検査・事後トレーニングが可能な推論モデルを開発できます。独自の報酬、シナリオ、評価スイートを用いて、このレシピをより広範に拡張してください。

始めたいですか？NVlabs/alpamayo-recipes GitHub リポジトリを確認し、本記事のレシピをあなた自身のユースケースに合わせて適応させてください。

公開リーダーボードでモデルを評価するには、CVPR 2026 で NVIDIA が開始した 2 つのオープン AV チャレンジをご覧ください：

AlpaSim クローズドループ E2E ドライブングチャレンジ

Physical AI AV Reasoning Challenge

詳しくは、Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation をご覧ください。

NVIDIA 創設者兼 CEO のジェンソン・ファン（Jensen Huang）のNVIDIA GTC Taipei 2026 キーノートに参加し、関連するセッションでさらに深く掘り下げてみましょう。

原文を表示

Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can reason over more complex driving scenes and produce richer intermediate reasoning are predominantly trained in open-loop, where model outputs are directly compared to ground-truth behaviors without considering their effect on the environment.

In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time.

A systematic means to address this challenge is provided by NVIDIA Alpamayo, an open portfolio of AI models, simulation frameworks, and physical AI datasets for AV development. Alpamayo includes the AlpaSim AV simulation platform and the AlpaGym closed-loop training framework (coming soon).

This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to:

Install and configure AlpaGym

Define closed-loop rewards

Launch closed-loop training

Export the post-trained checkpoint for downstream use

Closed-loop post-training with AlpaGym extends AV training workflows by turning AlpaSim rollouts into training experience. Rather than treating simulation only as a final evaluation stage, AlpaGym connects simulator feedback directly to the policy training loop.

Figure 1. End-to-end workflow for post-training a driving model such as Alpamayo using AlpaGym

How to use AlpaGym for closed-loop reinforcement learning

Reinforcement learning (RL) can be used to improve a policy that was initially trained in open-loop. Instead of optimizing only against logged expert trajectories, the model can now learn from the consequences of its own actions in simulation.

This shift is critical for AV development, where small prediction or planning errors can compound over time. In closed-loop training, each braking, steering, and navigation decision affects the next state of the environment, revealing failure modes that static datasets or open-loop evaluation may miss.

However, enabling closed-loop RL comes with its own challenges. Model inference, running simulation, training models, syncing weight updates, communicating across instances and moving data—all in parallel—is complex. This requires orchestration and efficient utilization of compute resources in a robust yet flexible manner.

Figure 2. AlpaGym enables large-scale closed-loop training, where driving models learn from the consequences of their own actions across a wide variety of simulated scenarios–greatly reducing the difference between training and deployment

To address these challenges, AlpaGym connects policy training to AlpaSim closed-loop rollouts and provides an open source, high-throughput framework for closed-loop RL. The system combines AlpaSim simulator microservices, NVIDIA Physical AI Open Datasets, and distributed NVIDIA Cosmos-RL training framework into a scalable post-training pipeline.

Built to scale seamlessly from a single GPU to multi-node GPU clusters, AlpaGym supports efficient large-scale training through an asynchronous and stable distributed RL pipeline, without requiring changes to user code. It integrates AlpaSim and Cosmos RL as its runtime and orchestration layer, GRPO as a default algorithm, and includes reference reward functions tested with Alpamayo models and the Physical AI AV NuRec dataset.

To get started with AlpaGym post-training, follow the steps outlined below.

Step 1: Install and configure AlpaGym

To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace:

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server git-lfs

git lfs install

git lfs pull

huggingface-cli login

# Or export HF_TOKEN=...

uv sync --all-packages

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server

uv sync --all-packages

The Python environment is managed by uv, but cuDNN, NCCL, and the redis-server binary are host dependencies used by the CUDA model stack and Cosmos-RL. Alternatively, a suitable Dockerfile is also provided. Hugging Face authentication is required to download the scene artifacts.

An AlpaGym run is a Hydra configuration. It specifies the policy checkpoint, the AlpaSim scene set, rollout parallelism, reward function, and Cosmos-RL training parameters. In this workflow, the starting checkpoint is an Alpamayo model.

Figure 3. In AlpaGym closed-loop post-training, the host process starts AlpaSim, rollout workers expose policy drivers, AlpaSim executes simulator sessions, and AlpaGym returns rollout artifacts and rewards to the trainer

Step 2: Define the closed-loop reward

The reward should match the behavior you want to improve in closed-loop. For trajectory-quality post-training, common reward terms include progress, lane keeping, collision avoidance, offroad rate, comfort, and distance to a reference trajectory.

A practical first reward is intentionally simple: combine progress with penalties for safety-critical failures. In AlpaGym, this can be expressed as a small sum of terms, using AlpaSim metrics where possible:

# reward/progress_safety.yaml

terms:

- kind: metric

metric_name: progress

scale: 1.0

- kind: metric

metric_name: collision_any

scale: -10.0

- kind: metric

metric_name: offroad

scale: -5.0

Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics.

Step 3: Launch closed-loop post-training

Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here.

uv run -m alpagym_host.cli \

policy=alpamayo \

policy.model.kind=alpamayo_r1 \

policy.model.path=/path/to/checkpoint \

reward=progress_safety

This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model.

During training, AlpaGym requests scene rollouts from AlpaSim, collects per-episode artifacts, computes rewards, and updates the policy. Useful training signals include mean reward, reward variance, failure rates, policy loss, rollout throughput, and the gap between generated rollouts and the latest policy weights.

In this recipe, these rollout artifacts and training signals are the primary outputs of the post-training run. They help you confirm that closed-loop learning is running correctly and select checkpoints for downstream evaluation on your own held-out AlpaSim scenario suites.

Step 4: Export the post-trained checkpoint

After training, place the AlpaGym-produced checkpoint and config files into a folder that can be accessed by the AlpaSim driver (your Hugging Face model cache, for example). Then create a new driver config with that folder path (called alpamayo1_CLRL here). See the following code for what to edit to specify custom paths in a driver yaml config. This makes the AlpaGym post-trained policy runnable inside AlpaSim for closed-loop rollouts.

...

model:

model_type: alpamayo1

checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN"

device: "cuda"

...

Next, run the exported model on a representative scenario to verify that the policy, driver, and simulation loop are connected correctly. At this stage, you can inspect how the policy behaves when its own actions affect the next state of the environment.

uv run alpasim_wizard deploy=local topology=1gpu

driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL

scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

A closed-loop rollout provides useful qualitative signals: whether the model produces stable trajectories and remains within the drivable area, how it reacts to nearby traffic agents, and which failure modes should be targeted during post-training.

With this checkpoint, teams can inspect rollout videos, per-episode metrics, reward traces, and failure cases collected during training. These artifacts are useful for debugging reward design, checking rollout stability, and selecting checkpoints for later held-out evaluation in AlpaSim.

Get started post-training AV models

Closed-loop post-training provides a practical path for iterating on end-to-end driving policies. In this case, AlpaGym uses closed-loop rollouts to post-train AV policies in simulation, enabling them to learn from the consequences of their actions.

You can use these tools together with the other components of the NVIDIA Alpamayo Open Platform to develop reasoning models that can be run, inspected, and post-trained in a closed-loop simulation workflow. Extend this same recipe more broadly with your own rewards, scenarios, and evaluation suites.

Ready to get started? Check out the NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases.

To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026:

AlpaSim Closed-Loop E2E Driving Challenge

Physical AI AV Reasoning Challenge

To learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation.

Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote and dive deeper with related sessions.

この記事をシェア

MarkTechPost★42026年6月20日 07:06

VibeThinker-3B：Qwen2.5-Coder-3Bを基盤にスペクトルから信号へのポストトレーニングパイプラインで構築された 30 億パラメータの密着型推論モデル

中国の新浪微博研究所が開発した「VibeThinker-3B」は、大規模なパラメータ数に依存しない効率的なアプローチを採用し、検証可能なタスクにおいて数百倍サイズのモデルと同等の性能を発揮する 30 億パラメータの推論モデルとして公開された。

TLDR AI★32026年6月19日 09:00

リプレイバッファを用いた難問の再検討（8 分読了）

研究者がリプレイバッファという手法を再評価し、AI モデルの学習効率や複雑な問題解決能力を向上させる可能性について議論している。

AWS Machine Learning Blog★42026年6月10日 05:07

Amazon SageMaker AI で NVIDIA Isaac Lab を活用し、ロボット強化学習のスケールアップを実現

AWS は、物理的AIの実用化に向け、Amazon SageMaker AI上でNVIDIA Isaac Labを活用することで、複雑なロボットの強化学習を高速化するソリューションを発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA Developer Blog·2026年6月1日 13:49·約11分で読める

NVIDIA アルパマイヨを用いたクローズドループでの自動運転モデルのポストトレーニング手法

#Autonomous Vehicles #Vision-Language-Action (VLA)#Reinforcement Learning #Simulation #NVIDIA Alpamayo

TL;DR

AI深層分析2026年6月11日 23:11

重要/ 5段階

深度40%

キーポイント

オープンループからクローズドループへの移行の必要性

NVIDIA Alpamayo と AlpaGym の役割

実装プロセスの具体化

シミュレーション評価から学習への転換

AlpaGym を用いたエンドツーエンドのポストトレーニングワークフロー

クローズドループシミュレーションによる効率的な学習

閉ループ学習の重要性

影響分析・編集コメントを表示

影響分析

編集コメント

本記事では、NVIDIA Alpamayo を用いて AV モデルをクローズドループで訓練する方法について説明します。具体的には、以下の手順を追跡します：

AlpaGym のインストールと設定
クローズドループ報酬の定義
クローズドループトレーニングの実行
下流処理での使用のためにポストトレーニングされたチェックポイントをエクスポート

image*図 1. AlpaGym を使用して Alpamayo などの運転モデルを事後トレーニングするためのエンドツーエンドのワークフロー*

クローズドループ強化学習における AlpaGym の使用方法

AlpaGym のポストトレーニングを開始するには、以下の手順に従ってください。

ステップ 1: AlpaGym のインストールと設定

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server git-lfs

git lfs install

git lfs pull

huggingface-cli login

または export HF_TOKEN=...

uv sync --all-packages

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server

uv sync --all-packages

ステップ 2: クローズドループ報酬の定義

reward/progress_safety.yaml

terms:

kind: metric

metric_name: progress

scale: 1.0

kind: metric

metric_name: collision_any

scale: -10.0

kind: metric

metric_name: offroad

scale: -5.0

パイプラインが安定したら、AlpaSim の動画およびメトリクスで観測された故障モードに対応した、よりターゲットを絞った項を追加します。

ステップ 3: クローズドループ事後トレーニングの起動

モデルのチェックポイントから AlpaGym のトレーニングを開始します。ここでは Alpamayo が例示モデルとして機能します。

uv run -m alpagym_host.cli \

policy=alpamayo \

policy.model.kind=alpamayo_r1 \

policy.model.path=/path/to/checkpoint \

reward=progress_safety

これにより、単一の GPU で AlpaGym と AlpaSim が起動します。ご自身の AV モデルの使用方法に関する詳細な手順については、追ってご案内いたします。

ステップ 4: ポストトレーニング済みチェックポイントのエクスポート

...

model:

model_type: alpamayo1

checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN"

device: "cuda"

...

uv run alpasim_wizard deploy=local topology=1gpu \

driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL \

scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

*Video 1. AV モデルの AlpaSim クローズドループロールアウト。レンダリングされたカメラビュー、予測軌道、ロールアウトレベルの診断情報を含む*

AV モデルのポストトレーニング開始

始めたいですか？NVlabs/alpamayo-recipes GitHub リポジトリを確認し、本記事のレシピをあなた自身のユースケースに合わせて適応させてください。

公開リーダーボードでモデルを評価するには、CVPR 2026 で NVIDIA が開始した 2 つのオープン AV チャレンジをご覧ください：

AlpaSim クローズドループ E2E ドライブングチャレンジ

Physical AI AV Reasoning Challenge

詳しくは、Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation をご覧ください。

原文を表示

In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time.

This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to:

Install and configure AlpaGym

Define closed-loop rewards

Launch closed-loop training

Export the post-trained checkpoint for downstream use

How to use AlpaGym for closed-loop reinforcement learning

To get started with AlpaGym post-training, follow the steps outlined below.

Step 1: Install and configure AlpaGym

To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace:

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server git-lfs

git lfs install

git lfs pull

huggingface-cli login

# Or export HF_TOKEN=...

uv sync --all-packages

sudo apt-get update

sudo apt-get install -y libcudnn9-dev-cuda-12 \

libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \

redis-server

uv sync --all-packages

Step 2: Define the closed-loop reward

# reward/progress_safety.yaml

terms:

- kind: metric

metric_name: progress

scale: 1.0

- kind: metric

metric_name: collision_any

scale: -10.0

- kind: metric

metric_name: offroad

scale: -5.0

Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics.

Step 3: Launch closed-loop post-training

Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here.

uv run -m alpagym_host.cli \

policy=alpamayo \

policy.model.kind=alpamayo_r1 \

policy.model.path=/path/to/checkpoint \

reward=progress_safety

This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model.

Step 4: Export the post-trained checkpoint

...

model:

model_type: alpamayo1

checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN"

device: "cuda"

...

uv run alpasim_wizard deploy=local topology=1gpu

driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL

scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

Get started post-training AV models

Ready to get started? Check out the NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases.

To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026:

AlpaSim Closed-Loop E2E Driving Challenge

Physical AI AV Reasoning Challenge

To learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation.

Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote and dive deeper with related sessions.

この記事をシェア

MarkTechPost★42026年6月20日 07:06

VibeThinker-3B：Qwen2.5-Coder-3Bを基盤にスペクトルから信号へのポストトレーニングパイプラインで構築された 30 億パラメータの密着型推論モデル

TLDR AI★32026年6月19日 09:00

リプレイバッファを用いた難問の再検討（8 分読了）

研究者がリプレイバッファという手法を再評価し、AI モデルの学習効率や複雑な問題解決能力を向上させる可能性について議論している。

AWS Machine Learning Blog★42026年6月10日 05:07

Amazon SageMaker AI で NVIDIA Isaac Lab を活用し、ロボット強化学習のスケールアップを実現

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

NVIDIA アルパマイヨを用いたクローズドループでの自動運転モデルのポストトレーニング手法

キーポイント

影響分析

編集コメント

クローズドループ強化学習における AlpaGym の使用方法

ステップ 1: AlpaGym のインストールと設定

または export HF_TOKEN=...

ステップ 2: クローズドループ報酬の定義

reward/progress_safety.yaml

ステップ 3: クローズドループ事後トレーニングの起動

ステップ 4: ポストトレーニング済みチェックポイントのエクスポート

AV モデルのポストトレーニング開始

How to use AlpaGym for closed-loop reinforcement learning

Step 1: Install and configure AlpaGym

Step 2: Define the closed-loop reward

Step 3: Launch closed-loop post-training

Step 4: Export the post-trained checkpoint

Get started post-training AV models

関連記事

NVIDIA アルパマイヨを用いたクローズドループでの自動運転モデルのポストトレーニング手法

キーポイント

影響分析

編集コメント

クローズドループ強化学習における AlpaGym の使用方法

ステップ 1: AlpaGym のインストールと設定

または export HF_TOKEN=...

ステップ 2: クローズドループ報酬の定義

reward/progress_safety.yaml

ステップ 3: クローズドループ事後トレーニングの起動

ステップ 4: ポストトレーニング済みチェックポイントのエクスポート

AV モデルのポストトレーニング開始

How to use AlpaGym for closed-loop reinforcement learning

Step 1: Install and configure AlpaGym

Step 2: Define the closed-loop reward

Step 3: Launch closed-loop post-training

Step 4: Export the post-trained checkpoint

Get started post-training AV models

関連記事