Hugging Face Blog·2026年2月24日 09:00·約6分

Jetson上でのオープンソース視覚言語モデル(VLM)のデプロイ

#Vision-Language Model #エッジAI #オープンソース #ロボティクス

TL;DR

Jetsonプラットフォームでオープンソースの視覚言語モデルを効率的にデプロイする方法について解説。

AI深層分析2026年2月24日 10:40

重要/ 5段階

キーポイント

NVIDIA Jetsonエッジデバイス上でオープンソースのVision-Language Model (VLM)を実装する具体的なチュートリアルを提供

NVIDIA Cosmos Reasoning 2BモデルをvLLMフレームワークとDockerコンテナで効率的にデプロイする方法を解説

Live VLM WebUIと連携させ、ウェブカメラを用いたリアルタイム対話型AIアプリケーションを構築可能に

Jetson AGX Thor、AGX Orin、Orin Super Nanoなど幅広いデバイスに対応し、FP8量子化でリソース制約を克服

影響分析・編集コメントを表示

影響分析

この記事は、高度なマルチモーダルAIをエッジデバイスに実装するための実践的なロードマップを提供しており、ロボティクスや物理AIアプリケーションの開発を加速させる可能性がある。NVIDIAの公式リソースと統合された方法論であるため、産業界での採用障壁を下げ、VLMの実世界展開を促進する重要な役割を果たす。

編集コメント

エッジAI開発者にとって即戦力となる実装ガイド。NVIDIAのエコシステムを活用した公式チュートリアルである点が信頼性と再現性を高めており、産業応用への道筋を明確に示している。

記事に戻る Jetsonでのオープンソース視覚言語モデル（VLM）のデプロイ

Upvote - Mitesh Patel mitp Follow nvidia Johnny Nuñez Cano johnnynv Follow nvidia Raymond Lo raymondlo84-nvidia Follow nvidia 視覚言語モデル（VLM）は、視覚的知覚と意味的推論を融合させることで、AIにおける大きな飛躍を象徴しています。固定されたラベルに制約される従来のモデルを超え、VLMは共同埋め込み空間を利用して、自然言語を用いて複雑でオープンエンドな環境を解釈し、議論します。

推論の精度と効率性の急速な進化により、これらのモデルはエッジデバイスに理想的となりました。高性能なAGX ThorやAGX OrinからコンパクトなOrin Nano Superまで、NVIDIA Jetsonファミリーは、物理AIとロボティクスのための高速化アプリケーションを駆動するために特別に構築されており、主要なオープンソースモデルに必要な最適化されたランタイムを提供します。

このチュートリアルでは、vLLMフレームワークを使用してNVIDIA Cosmos Reasoning 2BモデルをJetsonラインナップ全体にデプロイする方法を実演します。また、このモデルをLive VLM WebUIに接続し、インタラクティブな物理AIのためのリアルタイムのウェブカメラベースのインターフェースを有効にする手順を案内します。

対応デバイス:

Jetson AGX Thor Developer Kit

Jetson AGX Orin (64GB / 32GB)

Jetson Orin Super Nano

JetPack バージョン:

JetPack 6 (L4T r36.x) — Orinデバイス用

JetPack 7 (L4T r38.x) — Thor用

ストレージ: NVMe SSD 必須

~5 GB for the FP8 model weights

~8 GB for the vLLM container image

モデルとvLLMコンテナをダウンロードするには、NVIDIA NGCアカウント（無料）を作成してください

Jetson AGX Thor

Jetson AGX Orin

Orin Super Nano

nvcr.io/nvidia/vllm:26.01-py3

ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

FP8 via NGC (volume mount)

Max Model Length

256 tokens (memory-constrained)

GPU Memory Util

両デバイスでのワークフローは同じです:

NGC CLI経由でFP8モデルチェックポイントをダウンロード

デバイス用のvLLM Dockerイメージをプル

モデルをボリュームとしてマウントしてコンテナを起動

Live VLM WebUIをvLLMエンドポイントに接続

ステップ 1: NGC CLI のインストール

NGC CLIを使用すると、NVIDIA NGCカタログからモデルチェックポイントをダウンロードできます。

ダウンロードとインストール

mkdir -p ~/Projects/CosmosReasoning cd ~/Projects/CosmosReasoning # Download the NGC CLI for ARM64 # Get the latest installer URL from: https://org.ngc.nvidia.com/setup/installers/cli wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip unzip ngccli_arm64.zip chmod u+x ngc-cli/ngc # Add to PATH export PATH="$PATH:$(pwd)/ngc-cli"

CLIの設定

以下の入力を求められます:

API Key — NGC API Key setupで生成

CLI output format — jsonを選択

org — Enterを押してデフォルトを受け入れる

ステップ 2: モデルのダウンロード

FP8量子化チェックポイントをダウンロードします。これはすべてのJetsonデバイスで使用されます:

cd ~/Projects/CosmosReasoning ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

これにより、cosmos-reason2-2b_v1208-fp8-static-kv8/というディレクトリが作成されます。

ステップ 3: vLLM Dockerイメージのプル

Jetson AGX Thor の場合

docker pull nvcr.io/nvidia/vllm:26.01-py3

Jetson AGX Orin / Orin Super Nano の場合

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

ステップ 4: vLLMでCosmos Reasoning 2Bをサーブする

オプション A: Jetson AGX Thor

Thorは十分なGPUメモリを備えており、広いコンテキスト長でモデルを実行できます。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

モデルをマウントしてコンテナを起動:

docker run --rm -it \ --runtime nvidia \ --network host \ --ipc host \ -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvcr.io/nvidia/vllm:26.01-py3 \ bash

コンテナ内で環境をアクティベートし、モデルをサーブ:

vllm serve /models/cosmos-reason2-2b \ --max-model-len 8192 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.8

注: --reasoning-parser qwen3

--media-io-kwargs

以下の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

オプション B: Jetson AGX Orin

AGX Orinは、Thorと同じ広いパラメータでモデルを実行するのに十分なメモリを備えています。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

コンテナを起動:

docker run --rm -it \ --runtime nvidia \ --network host \ -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 \ bash

コンテナ内で環境をアクティベートし、サーブ:

cd /opt/ source venv/bin/activate vllm serve /models/cosmos-reason2-2b \ --max-model-len 8192 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.8

以下の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

オプション C: Jetson Orin Super Nano (メモリ制約あり)

Orin Super NanoはRAMが大幅に少ないため、積極的なメモリ最適化フラグが必要です。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

コンテナを起動:

コンテナ内で環境をアクティベートし、サーブ:

cd /opt/ source venv/bin/activate vllm serve /models/cosmos-reason2-2b \ --host 0.0.0.0 \ --port 8000 \ --trust-remote-code \ --enforce-eager \ --max-model-len 256 \ --max-num-batched-tokens 256 \ --gpu-memory-utilization 0.65 \ --max-num-seqs 1 \ --enable-chunked-prefill \ --limit-mm-per-prompt '{"image":1,"video":1}' \ --mm-processor-kwargs '{"num_frames":2,"max_pixels":150528}'

主要フラグの説明 (Orin Super Nano のみ):

--enforce-eager

メモリ節約のためCUDAグラフを無効化

--max-model-len 256

利用可能メモリに収まるようコンテキストを制限

--max-num-batched-tokens 256

モデル長制限に合わせる

--gpu-memory-utilization 0.65

システムプロセスのためのヘッドルームを確保

--max-num-seqs 1

メモリ最小化のため一度に1リクエストのみ

--enable-chunked-prefill

メモリ効率化のためプリフィルをチャンク処理

--limit-mm-per-prompt

プロンプトあたり画像1枚、動画1つに制限

--mm-processor-kwargs

動画フレーム数と画像解像度を削減

--VLLM_SKIP_WARMUP=true

時間とメモリ節約のためウォームアップをスキップ

サーバー準備完了の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

サーバーが実行中か確認

Jetson上の別のターミナルから:

curl http://localhost:8000/v1/models

応答にモデルがリストされているはずです。

ステップ 5: 簡単なAPIコールでテスト

WebUI接続前に、モデルが正しく応答することを確認します:

curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/cosmos-reason2-2b", "messages": [ { "role": "user", "content": "What capabilities do you have?" } ], "max_tokens": 128 }' | python3 -m json.tool

ヒント: APIリクエストで使用するモデル名は、vLLMが報告するものと一致させる必要があります。curl http://localhost:8000/v1/models

ステップ 6: Live VLM WebUIへの接続

Live VLM WebUIは、リアルタイムのウェブカメラからVLMへのインターフェースを提供します。vLLMがCosmos Reasoning 2Bをサーブしている状態で、ウェブカメラをストリーミングし、推論付きのライブAI分析を得ることができます。

Live VLM WebUIのインストール

最も簡単な方法はpipです（別のターミナルを開いてください）:

curl -LsSf https://astral.sh/uv/install.sh |

原文を表示

Back to Articles Deploying Open Source Vision Language Models (VLM) on Jetson

Upvote - Mitesh Patel mitp Follow nvidia Johnny Nuñez Cano johnnynv Follow nvidia Raymond Lo raymondlo84-nvidia Follow nvidia Vision-Language Models (VLMs) mark a significant leap in AI by blending visual perception with semantic reasoning. Moving beyond traditional models constrained by fixed labels, VLMs utilize a joint embedding space to interpret and discuss complex, open-ended environments using natural language.

The rapid evolution of reasoning accuracy and efficiency has made these models ideal for edge devices. The NVIDIA Jetson family, ranging from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super is purpose-built to drive accelerated applications for physical AI and robotics, providing the optimized runtime necessary for leading open source models.

In this tutorial, we will demonstrate how to deploy the NVIDIA Cosmos Reasoning 2B model across the Jetson lineup using the vLLM framework. We will also guide you through connecting this model to the Live VLM WebUI, enabling a real-time, webcam-based interface for interactive physical AI.

Supported Devices:

Jetson AGX Thor Developer Kit

Jetson AGX Orin (64GB / 32GB)

Jetson Orin Super Nano

JetPack Version:

JetPack 6 (L4T r36.x) — for Orin devices

JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

~5 GB for the FP8 model weights

~8 GB for the vLLM container image

Create NVIDIA NGC account(free) to download both the model and vLLM contanier

Jetson AGX Thor

Jetson AGX Orin

Orin Super Nano

nvcr.io/nvidia/vllm:26.01-py3

ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

FP8 via NGC (volume mount)

Max Model Length

256 tokens (memory-constrained)

GPU Memory Util

The workflow is the same for both devices:

Download the FP8 model checkpoint via NGC CLI

Pull the vLLM Docker image for your device

Launch the container with the model mounted as a volume

Connect Live VLM WebUI to the vLLM endpoint

Step 1: Install the NGC CLI

The NGC CLI lets you download model checkpoints from the NVIDIA NGC Catalog.

Download and install

Configure the CLI

You will be prompted for:

API Key — generate one at NGC API Key setup

CLI output format — choose json

org — press Enter to accept the default

Step 2: Download the Model

Download the FP8 quantized checkpoint. This is used on all Jetson devices:

cd ~/Projects/CosmosReasoning ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Serve Cosmos Reasoning 2B with vLLM

Option A: Jetson AGX Thor

Thor has ample GPU memory and can run the model with generous context length.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container with the model mounted:

Inside the container, activate the environment and serve the model:

vllm serve /models/cosmos-reason2-2b \ --max-model-len 8192 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.8

Note: The --reasoning-parser qwen3

--media-io-kwargs

Wait until you see:

INFO: Uvicorn running on http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the same generous parameters as Thor.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container:

Inside the container, activate the environment and serve:

Wait until you see:

INFO: Uvicorn running on http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (memory-constrained)

The Orin Super Nano has significantly less RAM, so we need aggressive memory optimization flags.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container:

Inside the container, activate the environment and serve:

Key flags explained (Orin Super Nano only):

--enforce-eager

Disables CUDA graphs to save memory

--max-model-len 256

Limits context to fit in available memory

--max-num-batched-tokens 256

Matches the model length limit

--gpu-memory-utilization 0.65

Reserves headroom for system processes

--max-num-seqs 1

Single request at a time to minimize memory

--enable-chunked-prefill

Processes prefill in chunks for memory efficiency

--limit-mm-per-prompt

Limits to 1 image and 1 video per prompt

--mm-processor-kwargs

Reduces video frames and image resolution

--VLLM_SKIP_WARMUP=true

Skips warmup to save time and memory

Wait until you see the server is ready:

INFO: Uvicorn running on http://0.0.0.0:8000

Verify the server is running

From another terminal on the Jetson:

curl http://localhost:8000/v1/models

You should see the model listed in the response.

Step 5: Test with a Quick API Call

Before connecting the WebUI, verify the model responds correctly:

Tip: The model name used in the API request must match what vLLM reports. Verify with curl http://localhost:8000/v1/models

Step 6: Connect to Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. With vLLM serving Cosmos Reasoning 2B, you can stream your webcam and get live AI analysis with reasoning.

Install Live VLM WebUI

The easiest method is pip (Open another terminal):

curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env cd ~/Projects/CosmosReasoning uv venv .live-vlm --python 3.12 source .live-vlm/bin/activate uv pip install live-vlm-webui live-vlm-webui

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git cd live-vlm-webui ./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090

Accept the self-signed certificate (click Advanced → Proceed)

In the VLM API Configuration section on the left sidebar: Set API Base URL to http://localhost:8000/v1

Click the Refresh button to detect the model

Select the Cosmos Reasoning 2B model from the dropdown

Select your camera and click Start

The WebUI will now stream your webcam frames to Cosmos Reasoning 2B and display the model’s analysis in real-time.

Recommended WebUI settings for Orin

Since Orin runs with a shorter context length, adjust these settings in the WebUI:

Max Tokens: Set to 100–150 (shorter responses complete faster)

Frame Processing Interval: Set to 60+ (gives the model time between frames)

Troubleshooting

Out of memory on Orin

Problem: vLLM crashes with CUDA out-of-memory errors.

Free system memory before starting:

sudo sysctl -w vm.drop_caches=3

Lower --gpu-memory-utilization

Reduce --max-model-len

Make sure no other GPU-intensive processes are running

Model not found in WebUI

Problem: The model doesn’t appear in the Live VLM WebUI dropdown.

Verify vLLM is running: curl http://localhost:8000/v1/models

Make sure the WebUI API Base URL is set to http://localhost:8000/v1

If vLLM and WebUI are in separate containers, use http://<jetson-ip>:8000/v1

Slow inference on Orin

Problem: Each response takes a very long time.

This is expected with the memory-constrained configuration. Cosmos Reasoning 2B FP8 on Orin prioritizes fitting in memory over speed

Reduce max_tokens

Increase the frame interval so the model isn’t constantly processing new frames

vLLM fails to load model

Problem: vLLM reports that the model path doesn’t exist or can’t be loaded.

Verify the NGC download completed successfully: ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/

Make sure the volume mount path is correct in your docker run

Check that the model directory is mounted as read-only (:ro

In this tutorial, we showcased how to deploy NVIDIA Cosmos Reasoning 2B model on Jetson family of devices using vLLM.

The combination of Cosmos Reasoning 2B’s chain-of-thought capabilities with Live VLM WebUI’s real-time streaming makes it ideal to prototype and evaluate vision AI applications at the edge.

Additional Resources

Cosmos Reasoning 2B on NVIDIA Build: https://build.nvidia.com/nvidia/cosmos-reason2-2b

NGC Model Catalog: https://catalog.ngc.nvidia.com/

Live VLM WebUI: https://github.com/NVIDIA-AI-IOT/live-vlm-webui

vLLM container for Jetson Thor: https://ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

vLLM container for Jetson AGX Orin, and Orin Super Nano: https://nvcr.io/nvidia/vllm:26.01-py3

NGC CLI Installers: https://org.ngc.nvidia.com/setup/installers/cli

Open Models supported on Jetson: https://www.jetson-ai-lab.com/models/

Getting started with Jetson: https://www.jetson-ai-lab.com/tutorials/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

MarkTechPost重要度42026年7月2日 17:46

Google Health API に CLI ツール「ghealth」登場：Fitbit データを AI エージェントへ

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年2月24日 09:00·約6分

Jetson上でのオープンソース視覚言語モデル(VLM)のデプロイ

#Vision-Language Model #エッジAI #オープンソース #ロボティクス

TL;DR

Jetsonプラットフォームでオープンソースの視覚言語モデルを効率的にデプロイする方法について解説。

AI深層分析2026年2月24日 10:40

重要/ 5段階

キーポイント

NVIDIA Jetsonエッジデバイス上でオープンソースのVision-Language Model (VLM)を実装する具体的なチュートリアルを提供

NVIDIA Cosmos Reasoning 2BモデルをvLLMフレームワークとDockerコンテナで効率的にデプロイする方法を解説

Live VLM WebUIと連携させ、ウェブカメラを用いたリアルタイム対話型AIアプリケーションを構築可能に

Jetson AGX Thor、AGX Orin、Orin Super Nanoなど幅広いデバイスに対応し、FP8量子化でリソース制約を克服

影響分析・編集コメントを表示

影響分析

編集コメント

記事に戻る Jetsonでのオープンソース視覚言語モデル（VLM）のデプロイ

対応デバイス:

Jetson AGX Thor Developer Kit

Jetson AGX Orin (64GB / 32GB)

Jetson Orin Super Nano

JetPack バージョン:

JetPack 6 (L4T r36.x) — Orinデバイス用

JetPack 7 (L4T r38.x) — Thor用

ストレージ: NVMe SSD 必須

~5 GB for the FP8 model weights

~8 GB for the vLLM container image

モデルとvLLMコンテナをダウンロードするには、NVIDIA NGCアカウント（無料）を作成してください

Jetson AGX Thor

Jetson AGX Orin

Orin Super Nano

nvcr.io/nvidia/vllm:26.01-py3

ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

FP8 via NGC (volume mount)

Max Model Length

256 tokens (memory-constrained)

GPU Memory Util

両デバイスでのワークフローは同じです:

NGC CLI経由でFP8モデルチェックポイントをダウンロード

デバイス用のvLLM Dockerイメージをプル

モデルをボリュームとしてマウントしてコンテナを起動

Live VLM WebUIをvLLMエンドポイントに接続

ステップ 1: NGC CLI のインストール

NGC CLIを使用すると、NVIDIA NGCカタログからモデルチェックポイントをダウンロードできます。

ダウンロードとインストール

CLIの設定

以下の入力を求められます:

API Key — NGC API Key setupで生成

CLI output format — jsonを選択

org — Enterを押してデフォルトを受け入れる

ステップ 2: モデルのダウンロード

FP8量子化チェックポイントをダウンロードします。これはすべてのJetsonデバイスで使用されます:

cd ~/Projects/CosmosReasoning ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

これにより、cosmos-reason2-2b_v1208-fp8-static-kv8/というディレクトリが作成されます。

ステップ 3: vLLM Dockerイメージのプル

Jetson AGX Thor の場合

docker pull nvcr.io/nvidia/vllm:26.01-py3

Jetson AGX Orin / Orin Super Nano の場合

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

ステップ 4: vLLMでCosmos Reasoning 2Bをサーブする

オプション A: Jetson AGX Thor

Thorは十分なGPUメモリを備えており、広いコンテキスト長でモデルを実行できます。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

モデルをマウントしてコンテナを起動:

コンテナ内で環境をアクティベートし、モデルをサーブ:

vllm serve /models/cosmos-reason2-2b \ --max-model-len 8192 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.8

注: --reasoning-parser qwen3

--media-io-kwargs

以下の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

オプション B: Jetson AGX Orin

AGX Orinは、Thorと同じ広いパラメータでモデルを実行するのに十分なメモリを備えています。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

コンテナを起動:

コンテナ内で環境をアクティベートし、サーブ:

以下の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

オプション C: Jetson Orin Super Nano (メモリ制約あり)

Orin Super NanoはRAMが大幅に少ないため、積極的なメモリ最適化フラグが必要です。

ダウンロードしたモデルへのパスを設定し、ホストのキャッシュメモリを解放します:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

コンテナを起動:

コンテナ内で環境をアクティベートし、サーブ:

主要フラグの説明 (Orin Super Nano のみ):

--enforce-eager

メモリ節約のためCUDAグラフを無効化

--max-model-len 256

利用可能メモリに収まるようコンテキストを制限

--max-num-batched-tokens 256

モデル長制限に合わせる

--gpu-memory-utilization 0.65

システムプロセスのためのヘッドルームを確保

--max-num-seqs 1

メモリ最小化のため一度に1リクエストのみ

--enable-chunked-prefill

メモリ効率化のためプリフィルをチャンク処理

--limit-mm-per-prompt

プロンプトあたり画像1枚、動画1つに制限

--mm-processor-kwargs

動画フレーム数と画像解像度を削減

--VLLM_SKIP_WARMUP=true

時間とメモリ節約のためウォームアップをスキップ

サーバー準備完了の表示が現れるまで待ちます:

INFO: Uvicorn running on http://0.0.0.0:8000

サーバーが実行中か確認

Jetson上の別のターミナルから:

curl http://localhost:8000/v1/models

応答にモデルがリストされているはずです。

ステップ 5: 簡単なAPIコールでテスト

WebUI接続前に、モデルが正しく応答することを確認します:

ヒント: APIリクエストで使用するモデル名は、vLLMが報告するものと一致させる必要があります。curl http://localhost:8000/v1/models

ステップ 6: Live VLM WebUIへの接続

Live VLM WebUIのインストール

最も簡単な方法はpipです（別のターミナルを開いてください）:

curl -LsSf https://astral.sh/uv/install.sh |

原文を表示

Back to Articles Deploying Open Source Vision Language Models (VLM) on Jetson

Supported Devices:

Jetson AGX Thor Developer Kit

Jetson AGX Orin (64GB / 32GB)

Jetson Orin Super Nano

JetPack Version:

JetPack 6 (L4T r36.x) — for Orin devices

JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

~5 GB for the FP8 model weights

~8 GB for the vLLM container image

Create NVIDIA NGC account(free) to download both the model and vLLM contanier

Jetson AGX Thor

Jetson AGX Orin

Orin Super Nano

nvcr.io/nvidia/vllm:26.01-py3

ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

FP8 via NGC (volume mount)

Max Model Length

256 tokens (memory-constrained)

GPU Memory Util

The workflow is the same for both devices:

Download the FP8 model checkpoint via NGC CLI

Pull the vLLM Docker image for your device

Launch the container with the model mounted as a volume

Connect Live VLM WebUI to the vLLM endpoint

Step 1: Install the NGC CLI

The NGC CLI lets you download model checkpoints from the NVIDIA NGC Catalog.

Download and install

Configure the CLI

You will be prompted for:

API Key — generate one at NGC API Key setup

CLI output format — choose json

org — press Enter to accept the default

Step 2: Download the Model

Download the FP8 quantized checkpoint. This is used on all Jetson devices:

cd ~/Projects/CosmosReasoning ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Serve Cosmos Reasoning 2B with vLLM

Option A: Jetson AGX Thor

Thor has ample GPU memory and can run the model with generous context length.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container with the model mounted:

Inside the container, activate the environment and serve the model:

vllm serve /models/cosmos-reason2-2b \ --max-model-len 8192 \ --media-io-kwargs '{"video": {"num_frames": -1}}' \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.8

Note: The --reasoning-parser qwen3

--media-io-kwargs

Wait until you see:

INFO: Uvicorn running on http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the same generous parameters as Thor.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container:

Inside the container, activate the environment and serve:

Wait until you see:

INFO: Uvicorn running on http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (memory-constrained)

The Orin Super Nano has significantly less RAM, so we need aggressive memory optimization flags.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8" sudo sysctl -w vm.drop_caches=3

Launch the container:

Inside the container, activate the environment and serve:

Key flags explained (Orin Super Nano only):

--enforce-eager

Disables CUDA graphs to save memory

--max-model-len 256

Limits context to fit in available memory

--max-num-batched-tokens 256

Matches the model length limit

--gpu-memory-utilization 0.65

Reserves headroom for system processes

--max-num-seqs 1

Single request at a time to minimize memory

--enable-chunked-prefill

Processes prefill in chunks for memory efficiency

--limit-mm-per-prompt

Limits to 1 image and 1 video per prompt

--mm-processor-kwargs

Reduces video frames and image resolution

--VLLM_SKIP_WARMUP=true

Skips warmup to save time and memory

Wait until you see the server is ready:

INFO: Uvicorn running on http://0.0.0.0:8000

Verify the server is running

From another terminal on the Jetson:

curl http://localhost:8000/v1/models

You should see the model listed in the response.

Step 5: Test with a Quick API Call

Before connecting the WebUI, verify the model responds correctly:

Tip: The model name used in the API request must match what vLLM reports. Verify with curl http://localhost:8000/v1/models

Step 6: Connect to Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. With vLLM serving Cosmos Reasoning 2B, you can stream your webcam and get live AI analysis with reasoning.

Install Live VLM WebUI

The easiest method is pip (Open another terminal):

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git cd live-vlm-webui ./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090

Accept the self-signed certificate (click Advanced → Proceed)

In the VLM API Configuration section on the left sidebar: Set API Base URL to http://localhost:8000/v1

Click the Refresh button to detect the model

Select the Cosmos Reasoning 2B model from the dropdown

Select your camera and click Start

The WebUI will now stream your webcam frames to Cosmos Reasoning 2B and display the model’s analysis in real-time.

Recommended WebUI settings for Orin

Since Orin runs with a shorter context length, adjust these settings in the WebUI:

Max Tokens: Set to 100–150 (shorter responses complete faster)

Frame Processing Interval: Set to 60+ (gives the model time between frames)

Troubleshooting

Out of memory on Orin

Problem: vLLM crashes with CUDA out-of-memory errors.

Free system memory before starting:

sudo sysctl -w vm.drop_caches=3

Lower --gpu-memory-utilization

Reduce --max-model-len

Make sure no other GPU-intensive processes are running

Model not found in WebUI

Problem: The model doesn’t appear in the Live VLM WebUI dropdown.

Verify vLLM is running: curl http://localhost:8000/v1/models

Make sure the WebUI API Base URL is set to http://localhost:8000/v1

If vLLM and WebUI are in separate containers, use http://<jetson-ip>:8000/v1

Slow inference on Orin

Problem: Each response takes a very long time.

This is expected with the memory-constrained configuration. Cosmos Reasoning 2B FP8 on Orin prioritizes fitting in memory over speed

Reduce max_tokens

Increase the frame interval so the model isn’t constantly processing new frames

vLLM fails to load model

Problem: vLLM reports that the model path doesn’t exist or can’t be loaded.

Verify the NGC download completed successfully: ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/

Make sure the volume mount path is correct in your docker run

Check that the model directory is mounted as read-only (:ro

In this tutorial, we showcased how to deploy NVIDIA Cosmos Reasoning 2B model on Jetson family of devices using vLLM.

The combination of Cosmos Reasoning 2B’s chain-of-thought capabilities with Live VLM WebUI’s real-time streaming makes it ideal to prototype and evaluate vision AI applications at the edge.

Additional Resources

Cosmos Reasoning 2B on NVIDIA Build: https://build.nvidia.com/nvidia/cosmos-reason2-2b

NGC Model Catalog: https://catalog.ngc.nvidia.com/

Live VLM WebUI: https://github.com/NVIDIA-AI-IOT/live-vlm-webui

vLLM container for Jetson Thor: https://ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

vLLM container for Jetson AGX Orin, and Orin Super Nano: https://nvcr.io/nvidia/vllm:26.01-py3

NGC CLI Installers: https://org.ngc.nvidia.com/setup/installers/cli

Open Models supported on Jetson: https://www.jetson-ai-lab.com/models/

Getting started with Jetson: https://www.jetson-ai-lab.com/tutorials/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

MarkTechPost重要度42026年7月2日 17:46

Google Health API に CLI ツール「ghealth」登場：Fitbit データを AI エージェントへ

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む