組み込みプラットフォームへのロボティクスAI導入:データセット記録、VLAファインチューニング、オンデバイス最適化
Hugging Face Blogは、NXP i.MX95などの組み込みプラットフォームにおけるVLAモデルのデプロイにおいて、非同期推論とハードウェア最適化がリアルタイム制御に不可欠であることを示す実践的なガイドを提供している。
キーポイント
非同期推論の必要性
同期パイプラインでは腕が待機状態になり振動が発生するため、生成と実行を分離する非同期推論により、エンドツーエンドの遅延が動作実行時間より短くなるよう設計する必要がある。
高品質データセットの記録基準
「茶バッグをマグに入れる」などのタスクにおいて、カメラの固定、照明の制御、コントラストの最大化、そして再記録を防ぐための厳格なキャリブレーションバックアップが品質向上に不可欠である。
システムエンジニアリングのアプローチ
単なるモデル圧縮ではなく、アーキテクチャの分解、レイテンシ対応スケジューリング、ハードウェアに適合した実行を含む複雑なシステムエンジニアリングが組み込みロボティクスには必要である。
実証された最適化結果
NXP i.MX95上でACTおよびSmolVLAポリシーをファインチューニングし、最適化後のリアルタイムパフォーマンスを実証している。
Gripper Cameraの活用
グリッパーに取り付けたカメラを使用することで、精密な把持タスクの成功率が向上し、演算子が直接視覚に頼らずロボットの知覚のみでデータ収集を行うことを強制できる。
把持性能のハードウェア改善
グリッパー爪に熱収縮チューブを巻くなどの簡単なハードウェア調整により、摩擦が増し滑りが減少することで、タスク成功率とポリシー学習の安定性が向上する。
データセットの多様性と分割
作業領域をクラスタに分割して位置や回転を変えて記録し、検証セットには学習セットに含まれていないクラスタを含めることで過学習を防ぎ、自由度の広い運動を記録する必要がある。
影響分析・編集コメントを表示
影響分析
この記事は、大規模マルチモーダルモデルをリソース制約の厳しいエッジデバイスに実装する際の具体的な障壁と解決策を提示しており、ロボティクス業界におけるAIの実用化プロセスに重要な指針を与える。特に「非同期推論」の概念とデータ品質管理の具体例は、開発者が理論から実装へ移行する際に直面する現実的な課題を解決するのに役立つ。
編集コメント
モデルの性能だけでなく、システム全体のレイテンシとハードウェア制約をどう設計するかという「現場目線」の技術記事は、実装段階にある開発者にとって非常に価値が高い。
さらに、我々は異なるブロックに対して社内最適化を適用しました。結果は以下の表に示されており、最適化済みモデルとして参照されています。
3) 非同期推論: 制御を考慮したスケジューリング
同期制御パイプラインでは、以下のように動作します:
観測をキャプチャする
完全なモデル推論を実行する
生成されたアクションを実行する
ステップ(2)の間、ロボットアームはアイドル状態になります。推論遅延が無視できない場合、これは以下を引き起こします:
動作中のアイドル時間
古くなった観測に基づく振動補正
実効的な制御周波数の低下
リカバリ動作の悪化
非同期推論では、アクション生成と実行が並行して動作します:
ロボットは現在のアクションチャンクを実行する
次のチャンクは同時に計算される
これにより、実効的な制御周波数が向上し、観測の陳腐化が減少し、リカバリ動作が改善されます。
i.MX95のような組み込みプラットフォームでは、非同期推論は必須ですが、推論遅延がアクションホライズンの予算内に収まる場合にのみ有効です: $T_{\text{推論}} < T_{\text{実行}}$
同期推論
非同期推論
チャンクあたりのアクション数
チャンクサイズ閾値
集約関数
加重平均
アクションキューの進化


📊 i.MX95で達成した内容

タスク: 「ティーバッグをつかみ、マグカップの中に置く」
テストセット(20エピソード): 各クラスターの2つのランダムな位置
検証セット(10エピソード): クラスターn°6の全ての10位置
推論遅延
精度 テストセット(20)
精度 検証セット(10)
全体精度(30)
我々の当面の目標は、SmolVLA(ONNX FP32)のタスク精度を向上させることです。既にベースラインを確立し、最適化されたオンボード推論遅延6.15秒を計測しました。
次のフェーズでは、NPU(Neural Processing Unit)に対する更なる最適化に焦点を当てます。並行して、単一タスク設定から、より長いホライズンとより複雑なシナリオへ移行することを目指します。そのために、以下を導入します:
スケーラブルなデータ生成とベンチマークのためのシミュレーション環境
ポリシー改善のための強化学習(Reinforcement Learning, RL)
ドメインギャップを埋め実世界性能を向上させるためのSim-to-Real転送
目標は、単一の検証済み操作タスクから、組み込みロボットシステム上でVLAポリシーを展開するための再現可能な方法論へ移行することです。
✅ 再利用可能なチェックリスト
固定マウントの確認済み
良好なカメラの焦点と照明
良好なグリッパークローの把持力
キャリブレーションファイルのバックアップ保存済み
コントラスト検証済み
20,000ステップごとにチェックポイントを保存/評価
必要に応じてトレーニングを再開できるようトレーニングパラメータも保存
精度と遅延を計測するための検証セットと追跡方法を事前に準備
i.MX95への展開
精度に満足している
モデル最適化のために我々に連絡
📚 リソースとインスピレーション
ACT(Action Chunking Transformer)のドキュメントと論文(コアイデア、アクションチャンキング、低デモ成功率)。[huggingface.co]、[arxiv.org]
SmolVLM/SmolVLAファミリーとリポジトリ(コンパクトなマルチモーダル+VLA設計)。[huggingface.co]、[github.com]、[smolvla.net]
Sherry ChenのSO-101でACTをトレーニングするHFブログ(実践的な教訓、落とし穴、修正)。[huggingface.co]


原文を表示
Back to Articles Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations
Upvote 2


Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the integration of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Action (VLA) models. Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements.
In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration. This temporal constraint therefore sets an upper limit on the model's throughput.
Bringing VLA models to embedded platforms is not a matter of model compression, but a complex systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is essential to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.
This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, fine‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX95 achieves after optimization.
🎥 Dataset Recording: What Actually Matters
High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.
In our case, we recorded a dataset for the task: "Put the tea bag in the mug."
1) Consistency First
Fixed cameras: Use rigid mounts to avoid pose drift. If during recording or evaluation one or more cameras shift because of the robot's vibrations or the operator resetting the environment, you can observe a severe accuracy loss.
Controlled lighting: Set up your environment where you can have as much control as possible on lighting (Fixed light source(s) and far from sunlight that vary during the day).
Strong contrast: Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the object and the environment.
Fixed calibration: Make sure to have backups of your robot and teleoperator calibrations so you don't have to re-record your previous episodes if the code crashes.
Do not cheat: Do not use information the model will not have access to at inference time. During data recording, it is tempting for the operator to rely on direct visual observation of the scene. However, this introduces information that is absent from the dataset. Dataset collection must be restricted to the same camera inputs that will be available to the policy at runtime.
2) Use a Gripper Camera (Highly Recommended)
Moving from scene‑only views to mixed viewpoints increases the global accuracy, but the more cameras you have the more the latency is impacted. Therefore, you must choose right compromise. In our case that balance was reached with 3 cameras:



The global view of the whole scene.
The closest view for precise grasps and alignment.
Complement the top view for height and depth.
We strongly recommend using a gripper-mounted camera. It consistently improves success rates on fine manipulation tasks by providing a close, task-relevant viewpoint. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception rather than observing the scene directly.
When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to prevent it from obstructing the field of view or becoming disconnected during motion.
3) Improve Prehension

Simple hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.
4) Diversity & Splits

When recording a dataset, you should:
Vary episodes distribution: Divide your workspace into starting-position clusters, and record at least 10 episodes per cluster. Add diversity by changing the object position and rotation.
e.g. we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.
Differentiate training & validation sets: Policies can easily overfit on the training set, so make sure that the validation set is unseen by the model.
e.g. we removed cluster 6 from the training set.
Record the most movements you can: Small VLA models exhibit limited generalization on unseen motion. Therefore, record episodes that cover the wider ranges of degrees of freedom.
e.g. we grasped the tea bag either in horizontal or vertical position.
Anticipate failure: Sometimes the policy will not reach the object the first time and will have to "go back to it". We noticed that having 20% of all episodes that corresponds to the case of going back to the object help the model improve overall success rate.
e.g. around 20% of our training set corresponds to recovery episodes.
This mirrors best practices across VLA papers and community guides. Here are 3 examples of data diversity within the same cluster:
Starting position 1
Starting position 2
Recovery episode



Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during the recovery episode, the robot does not begin in "starting mode"; but is instead already near the mug and should proceed directly to retrieve the tea bag from that location.
🎛️ Fine‑Tuning VLAs

What we did in practice:
Tasks: "Grab the tea bag and place it in the mug."
Dataset: 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes)
3 cameras (640x480px, 30fps): Top, Gripper, Left
Cluster n°6 was removed for validation
Training: Model checkpoint with the lowest validation loss after 200k steps was chosen
The range providing the best trade-off between accuracy, generalization, and motion smoothness across both the training and validation sets was found for ACT (100 actions per chunk) within a 100k-160k training steps. For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training slightly past the point where the model begins to overfit tends to improve overall accuracy.
Rule of thumb: choose final checkpoint by evaluating success on both training and validation set, not by training loss.
⚡ Optimizing for NXP i.MX95
The i.MX95 integrates 6× Arm Cortex‑A55, Cortex‑M7/M33, a Mali GPU, new ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and strong I/O. [nxp.com]
1) Divide And Conquer
Instead of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and action experts. Therefore, allowing each component to be optimized, scheduled, and deployed independently.
In practice, SmolVLA is partitioned into the following sub-blocks:
Vision: processes RGB camera frames and produces visual embeddings.
LLM backbone: generates actions tokens from visual and textual embeddings.
Action expert: applies flow matching to iteratively denoise action samples and outputs final control commands.
This separation allows per-block optimizations. The impact of each block quantization can be measured to choose the best tradeoff between latency and accuracy. Also, isolating the action expert from the VLM was ideal to run it at lower frequency.
2) Quantization
In order to optimize the inference for i.MX95, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow in the action expert significantly degrades performance. This behaviour is expected, as quantization errors are accumulating across iterative denoising steps.
That is why we decided to keep this block at higher precision to preserve stability, while on the other blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.
In addition, we applied in-house optimization on the different blocks. Results are shown in the below table, referred as optimized models.
3) Asynchronous Inference: Control-Aware Scheduling
In a synchronous control loop, the pipeline operates as:
Capture observation
Run full model inference
Execute generated action
During step (2), the robot remains idle. If inference latency is non-negligible, this produces:
Idle gaps in motion
Oscillatory corrections due to stale observations
Reduced effective control frequency
Poor recovery behavior
With Asynchronous Inference, action generation runs in parallel with execution:
The robot executes the current action chunk
The next chunk is computed simultaneously
This increases effective control frequency, reduces observation staleness, and improves recovery behavior.
On embedded platforms such as i.MX95, asynchronous inference is essential — but only effective if inference latency is kept under the action horizon budget: $T_{\text{inference}} < T_{\text{execution}}$
Synchronous inference
Asynchronous inference
Actions per chunk
Chunk size threshold
Aggregate function
weighted_average
Action queue evolution


📊 What We Achieve on i.MX95

Tasks: "Grab the tea bag and place it in the mug."
Test set (20 episodes): 2 random positions for each cluster.
Validation set (10 episodes): all 10 positions in cluster n°6
Inference Latency
Accuracy Test Set (20)
Accuracy Validation Set (10)
Global Accuracy (30)
Our immediate objective is to improve task accuracy with SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized on-board inference latency of 6.15 s.
The next phase will focus on deeper optimizations on our NPUs. In parallel, we aim to move from single-task setup toward longer-horizon and more complex scenarios. To do that, we will introduce:
Simulation environments for scalable data generation and benchmarking
Reinforcement Learning (RL) for policy refinement
Sim-to-Real transfer to bridge domain gaps and improve real-world performance
The goal is to move from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.
✅ Checklists You Can Reuse
Fixed mounts verified
Good cameras focus and illumination
Good gripper claws prehension
Calibration files backups saved
Contrast validated
Save/eval checkpoints every 20k steps
Save also your training parameters to be able to resume training if needed
Prepare in advance your validation set and your tracking method for accuracy and latency
Deployment on i.MX95
You are satisfied with your accuracy
Contact us to have you model optimized
📚 Resources & Inspiration
ACT documentation & paper (core idea, action chunking, low‑demo success). [huggingface.co], [arxiv.org]
SmolVLM/SmolVLA family & repos (compact multimodal + VLA design). [huggingface.co], [github.com], [smolvla.net]
Sherry Chen’s HF blog on training ACT on SO‑101 (practical lessons, pitfalls, fixes). [huggingface.co]


関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み