TLDR AI·2026年6月29日 09:00·約31分

物理的 AI におけるマネーボール戦略（26 分読）

#Physical AI #Data Strategy #Robotics #Quantitative Analysis

TL;DR

この記事は、物理的 AI（Physical AI）分野におけるデータ評価の誤りを指摘し、野球の「マネーボール」戦略のように、直感的な指標ではなく実効性を示す統計データを重視するべきだと論じている。

AI深層分析2026年6月30日 02:05

重要/ 5段階

深度40%

キーポイント

市場のミスプライシングと直感の罠

従来のスカウトや業界関係者が重視する主観的な美しさや伝統的指標（例：盗塁、打率）は、実際の成果（得点やタスク完了）との相関が薄く、データ資産が過小評価されている。

物理的 AI における真のシグナル

物理的 AI の成功には、オンベースパーセンテージ（OBP）に相当する「タスク完了率」や「環境適応力」といった定量的な統計指標が不可欠であり、これらを特定することが競争優位の鍵となる。

データ駆動型戦略への転換

直感や経験則に頼る従来のアプローチから脱却し、数学的に分析された指標に基づいてリソース配分と技術開発を行う「マネーボール」的な思考が物理的 AI 業界に求められる。

影響分析・編集コメントを表示

影響分析

この記事は、物理的 AI（ロボット工学や自律システムなど）の発展段階において、業界が陥りやすい「直感バイアス」を警告し、データドリブンな評価基準への転換を促す重要な示唆を与えています。投資家や開発者が従来の勘や外見重視の評価から脱却し、実効性を証明する指標にリソースを集中させることで、業界全体の効率化と加速が期待されます。

編集コメント

野球の「マネーボール」戦略を物理的 AI に適用するというメタファーは、業界が直面している評価基準の混乱を鋭く指摘しており、開発者や投資家にとって非常に示唆に富む視点です。

2002 年、オークランド・アスレチックスはメジャーリーグベースボールの中で給与総額が第 3 位に過ぎないにもかかわらず、103 勝を記録しました。この優位性は、選手資産の市場価格が不当に設定されていたことに起因します。従来のスカウトは主観的な審美性、盗塁、打率を重視する一方、先見性のある経営陣は走者出塁率という統計値に数学的に焦点を絞り、これが得点と実際に相関することを突き止めました。

*直感に頼る評論家であふれる分野の中で、正しい統計を用いてシグナルを見つけること:* マネーボール！

物理的 AI（Physical AI）におけるデータは誤解され、不当に評価されています。

物理的 AI 用のデータは存在しません。データには生成に伴う本質的なコストが存在します。私たちは数時間やトークン単位での素朴なスケーリングから脱却する必要があります。

スケール重視の考え方は往々にして「データを信じる」という意味合いを帯びますが、テキストとは異なり、ロボット用のデータは採掘して得られるものではありません。有用な 1 時間はすべて対価を支払って得る必要があるため、収集量は線形に増加する一方、コストは低下しません。最近、ケン・ゴールドバーグは、最先端のロボティクスモデルには約 10 万年分のデータが必要かもしれないと推定しています。

AGI（人工一般知能）革命は、劣悪な労働環境での遠隔操作（Sweatshop Teleop）によって成し遂げられるものではありません。

このボトルネックを回避するために、業界は手動の遠隔操作インフラを拡張してきました。しかし、累積運用時間を最適化することは、初期の野球における「打率」の誤謬を再現するものであり、実際の下流モデルのパフォーマンスと相関が弱い、目に見えやすく資金調達が容易な指標を優先することになります。代替戦略として、ロボットを生産に投入してテレメトリデータを収集し、運用収益の副産物としてゼロコストで活用するという提案もあります。このモデルは、同じ統計的誤謬のより微妙なバージョンを導入するものです。今日、展開が可能であるニッチとは、変動が最も少なく、最小限の限界効用しか生み出さない低エントロピーで相関のあるデータストリームを持つ領域です。

本稿では、データの限界効用に関する枠組みを構築し、それを物理的 AI における価値蓄積について議論するために使用します。 私たちは、データ量に伴う損失の振る舞いを導くスケーリング法則と、1 ドルのデータがどの程度の価値を持つかを支配する単位経済学の視点からアプローチします。これらを組み合わせることで、物理的 AI の出塁率に相当する、ドルあたりの概算限界効用を得ることができます。

資本効率を高めるのは、データの量を最大化することではなく、データの新奇性を正確に計算し価格設定することです。 結論だけを読みたい場合は、推奨事項へジャンプしてください。

多様な利害関係者はデータに対して異なる見解を持っています。都合よくも、*それぞれの世界観が自らの部分を最も価値あるものだと主張します。*

ファウンデーションモデルラボは一般化されたモデルの規模を販売するため、大規模事前学習の役割を過大評価し、生計算資源のスケーリングが最終的にエッジケースエラーを排除すると仮定して運営されています。テレオペレーションベンダーはインフラストラクチャユーティリティとして位置づけられ、生操業時間を優先・収益化しており、その収益はデータ量に比例して拡大しますが、有用性や新規性には比例しません。ハードウェアの既存企業は環境の定常性を前提としており、分布外（out-of-distribution）の状況ではソリューションが機能しないためです。そして、多くの学術ロボット研究者はこれがデータの問題ではないと否定し、物理学・モデル・制御によってギャップを埋められると信じていますが、データの洪水なしに実現できると考えています。

分析すべき主要なアーキタイプは「ネオ・インテグレーター」です。このモデルは、人間の監視を伴うオペレーション障害の管理を通じて、商業生産に特化したロボットユニットを展開することで、データ収集のボトルネックを回避しようと試みます。その核心となる仮説は、証明されていない経済的なフラインホイール（増幅装置）に依存しています：生産テレメトリが、マルチタスク能力を訓練するために必要な新規性を生成するというものです。Standard Bots の Evan Beard はこの点を詳細に論じています。Kyle Vedder は「まず展開」という立場に反発し、初期段階の展開に対して支払いを行う環境は自然に変動が小さいため、「新規性ポンプ」の制約が生じると主張しています。

私たちは、この議論を、実証的なスケーリング法則とデータ収集の単位経済学を組み合わせた中立的な枠組みを通じて分析し、1 ドルあたりのモデル能力が最も高くなる配分戦略が何かを正確に特定します。

物理 AI におけるデータ運用は、コストと情報密度の間のトレードオフによって定義される 3 つのモダリティにまたがります:

観測データ：低コストで広範な範囲をカバーするが行動情報が不足したコーパス（例：主観視点および客観視点の動画）。このモダリティは表現のサポートを広げますが、直接的な行動監督には欠けます。
介入データ：高コストで範囲は狭いが行動密度の高いデモンストレーション（例：遠隔操作）。このモダリティは明示的な状態・行動軌跡をマッピングしますが、人的労働に対して線形にスケーリングします。
展開データ：生産システムによって生成される内生テレメトリであり、多くの場合損失を出しながら稼働しています。このモダリティは未整理であり、アルゴリズム設計ではなく商業運営によって決定された環境分布をサンプリングします。

データの最大化は、トレーニング効率を低下させる低エントロピーのノイズを導入することがあります。言語モデリングにおける C4 データセットで示されたように、*サブセットの削減がモデルの改善をもたらすことが実証されています*。具体的には、固定予算内で固有のトークンカバレッジを最大化するために、定型文やニアコピーをフィルタリングすることが重要です。

ステークホルダーとして、私たちが問わなければならないのはこれらです。各タイプのデータにおいて、1 ドルは何をもたらすのか？新しい情報はどこから来るのか？そして、収集するために支払われるデータであるデプロイメントは、実行可能なタスクの範囲を広げるのか、それともすぐに枯渇してしまうのか。

データパイプラインの評価は資本配分の問題である: データの限界コストと、モデルの一般化能力を高めるための新規情報とのバランスを取ること。

スケーリング則に関する文献は、言語モデルについてこれらの問いに答えています。データセットにとって重要なのはサイズだけではありません：それが保持する固有の例の数、混合の多様性、各例が繰り返される頻度、そして新しいデータが既存データに対してどの程度近いかです。

*はい、限界効用逓減のべき乗則として、下限まで低下します。* テスト損失は、データ、モデルサイズ、計算リソースに対して対数-対数プロット上で直線状に減少します (Kaplan 2020)。サイズを *N*、トークンを *D* とし、結合スケーリング定式化 (Hoffmann 2022) の下では、損失は以下のようにモデル化されます：

関数形は一定ですが、数値は近似値に留まります (Besiroglu 2024)。計算最適配分において、削減可能な 2 つの項はデータ率で減衰し、一次元の包絡線へと収束します。

定数 E は、モデルの不可避的な予測不確実性を表しています。

*はい、データセットの量とは独立した軸を横断して運用することです。多様なデータの混合は、2 つの同時効果をもたらします：ドメイン間の転移と多様体のカバレッジ拡大を通じて漸近的な誤差の下限を引き下げると同時に、データセットの本質次元（*dint*）を増大させます。滑らかなターゲットの場合、解像度制限された領域では *β ≈ 4/dint* が成り立ちます。ここで*dint*はデータの多様体の本質次元です (Sharma & Kaplan 2020; Bahri 2021)。

**β が次元の逆数として現れるため、タスクの本質次元を半分にするだけで、スケーリング指数はおおよそ倍になります：損失曲線がより速く低下します。しかしこれは、一般化をもたらさない劣った最適解への収束という代償を伴います。一般化を最大化するためには、事前学習の分布は意図的に本質次元を人為的に低くしないようにする必要があります。

データ混合則 (Ye et al. 2024) は、混合されたデータの損失を直交する各ドメインごとのべき乗則と相互結合項に分解し、これらが正の転移をもたらすか負の干渉をもたらすかを決定します。

反復は約 4 エポックまで限界効用を提供し、新鮮なトークンの効率と同等となります。この閾値を超えると効用は急速に減衰し、最終的には能力を低下させます。Muennighoff et al. (2023) は、半減期 *R* ≈ 15 の指数関数的飽和をフィットさせています 1: 4 回のパスでは無視できるペナルティしか生じませんが、16 回のパスは限界効用逓減の厳格な領域を定義し、追加の計算資源は情報獲得をもたらしません。さらに、限られたデータ断片に過度に依存することは、テスト損失において局所的なダブル・デセント（二重下降）異常を引き起こし、コンテキスト内学習を支配する回路メカニズム、特に誘導およびコピーヘッドを根本的に劣化させます (Hernandez et al. (2022))。コーパスのわずか 0.1% を 100 回反復するだけで、800M パラメータモデルのダウンストリーム性能は 400M パラメータベースラインレベルまで崩壊し、わずかな分布上の冗長性でさえ巨額の資本流出を引き起こすことが示されます。

LLM（4.2B パラメータ）の損失は、反復データ上でスケーリングされた結果として減衰し、予想よりも悪いパフォーマンスをもたらします (Muenninghoff et al 2023)

重複に近いサンプルは、完全な反復と全く新しいサンプルの間に位置するユーティリティ連続体上に存在します。これらの冗長性を除去することは、モデルの一般化能力を向上させると同時に、固有の多様体に対するトークン予算を最適化します。Lee et al. (2021) は、C4 コーパス内で 60,000 回以上出現する個々の文が存在することを発見しました。大規模コーパスにおける冗長性は、逐語的な暗記を緩和し収束速度を加速させるために体系的な重複除去を必要とします。メカニズム的には、小さな摂動がモデルに有界な近傍（*x* と *x + ε*）全体で同一のターゲットをマッピングさせるよう強制し、これは暗黙的な一貫性正則化として機能します。その結果、重複に近いサンプルは非常に低いユーティリティを持ちます。中程度の *ε* において、正則化は有用であり、*ε* が拡大するとそれは独立したデータポイントとなります。狭い近傍内で密にサンプリングすることは局所的な容量を急速に飽和させ、モデルのパフォーマンスを低下させます。

稀な外れ値（OOD）事象は、スケーリング限界におけるモデル性能が失敗の裾野によって制約されるため、極めて大きな限界効用をもたらします。現実世界の物理的分布は重尾性を示し、マクロ能力のスケーリングは、頻度に基づいて順次獲得されたサブスキルの Zipf 分布を習得することから生じます（Michaud et al., 2023）。最先端の精度を達成するには、これらが総体的に運用密度の大部分を構成するこれらの稀なサブ集団に適合させる必要があります（Feldman, 2020）。したがって、高難易度かつ低頻度のサンプルに対してフィルタリングを行うことで、標準的なべき乗則スケーリングの制約を完全に回避することができます（Sorscher et al., 2022）。これらのエッジケースは現実世界の確率過程に根ざしているため、合成生成や構造化された段階的展開によって再現することは困難です。しかし、モデルの既知分布が拡大するにつれて、残る新規変異は指数関数的に希少になり、発見にかかる限界費用が急激に上昇します。

要約:

データ量を増やすことは、一定の下限までべき乗則を達成できる。
多様性は、取得速度を犠牲にしてその下限を下げる。
反復はほとんど効果をもたらず、最終的には性能を低下させる。
ほぼ重複するデータは、意図的な微小摂動を除けば最も弱いものである。

長期の裾野にある稀な事象は非常に情報量が多いが、発見にかかるコストは次第に高騰している。

言語モデルにおいて、計算資源は制約要因であり、データは豊富かつ低コストである。一方、ロボット工学においては、有用なデータはデータ収集コストによって厳しく制限されている。その結果、目的関数は計算効率の最大化から、1 ドルあたりの限界損失削減の最大化へとシフトする。

グローバルな能力目標は、事前重み（πj）が割り当てられた離散的なタスククラスター j に対する凸結合としてモデル化される。各独立したクラスターは、環境パラメータに条件付けられた独自のスケーリング包絡線に従う：

データ削減可能な項に対する下限 Aj(φ) は、指数 βj ≈ 4/dj によって決定され、これはクラスターの内在次元（dj）に依存する。

有限の資本配分を最適化するためには、利用可能な収集およびキュレーションチャネルすべてにおいて、1 ドルあたりの限界価値が等しくなるように資源支出を行う必要がある。

インターベンションチャネル：アクティブなデモンストレーションデータは、直接行動の監督に対してプレミアム価格を持ちます。これは急速なボリューム飽和を引き起こし、主にクロスタスクスキル転移を通じて経済的効用を生み出します。ある行動チャネル i は、方向性カバレッジ Dj = Σi wij gi(ni) を価格 ci で購入します。ここで gi は飽和し、wij はチャネル i からタスククラスター j へのドメイン間転送投影を表します。このリターンを資本支出に対して微分することで、1 ドルあたりの限界効用がマッピングされ、資源配分の最適化が行われます。
オブザーベーションチャネル：パッシブな観測データは独自の価格を持ちます。行動ラベルがない場合、これは基盤となる表現空間の最適化を通じて経済的効用を生み出します。同時に、アレルトリック（事象的）下限を抑制し、スケーリング係数を正則化します。

この下限はセンサーに依存します。 アレルトリック項は、特定のロボット構成によって観測される情報状態に対してのみ不可減です。私たちは、センシング構成 φ におけるタスククラスター j の下限を条件付きエントロピーを通じて形式化します：Aj(φ) = E [H[a | sφ]]。この総リスクは、絶対的な物理的限界とセンサーで対応可能なマージンに分解されます：

運用上の示唆。低解像度センサーでは分解できない環境変動は、そのモデルに対して確率的なアロエトリックノイズとして現れますが、より忠実度の高いセンサーを使用すれば、それは予測可能なエピステミックエラーに変換されます。アクションデータはデータ削減項を *Aj(φ)* へと押し下げ、一方、より優れたセンシングは *Aj(φ)* そのものを低下させます。

タスクが実現可能なのは、損益分岐点の損失閾値が *Aj(φ) < Lneutral* を満たす場合に限りです。最適なセンシングによって *Ajmin* が *Lneutral* 以上になる場合、データ量の拡大は数学的に無意味となります。このシステムでは、ハードウェアの再構成か、全く異なる運用タスクへの切り替えが必要です。

生産テレメトリは急激な減衰曲線を描く石油井戸のような振る舞いをします：初期の運用では高エントロピーの故障モードが生じますが、異常が解決されるにつれて急速に低効率でほぼ重複した状態へと衰退します。この局所的な分布サンプリングは指数関数的飽和を示し、*Ueff (n) = U0 + ΔU(1 - e-n/n_c)* となります。被覆数（nc）を超えると、生産ストリームは低効率的な純粋な反復へと崩壊します。

高価値のデータは厳密に故障の裾野に集中しており、通常の運用成功には限界効用はゼロです。価値があるのは失敗の中にあり、通常の成功の中ではありません。重要なのは、展開時における収益相殺後のデータ 1 件あたりのネットコスト（*cdep*）が、モデルの瞬間的な損失（*L*）によって内生的に決定され、非線形にスケールし、かつその損失値によって上限が定められる点です。

すべての用語は、ログされたタスク試行あたりのドル額を意味します。*v* はタスク完了あたりの価値、*ρ(L)* はエラー率、κ_int と κ_prod はそれぞれ試行あたりの介入コストと生産性損失コストです。運用上の損益分岐点（*cdep*~0）に達する前、データ収集は赤字状態で運営されており、これは初期展開フェーズを運用収益ではなく研究開発資産として資本化しなければならないことを意味します。**

まだ準備が整っていないモデルの展開の例を示します。対数スケールの X 軸（データ/コストを捉える）を用いると、費用中立性を実現する改善は大きくなる可能性がありますが、このデータが高価値であるとは限りません。

ギャップを越えるには資本が必要です。 一般的な通説として、「モデルが少なくとも 95% のパフォーマンスを発揮し、エラーハンドリングに介入を行うことから始め、99.5% に達した時点で展開が利益を生むようになる」というものがあります。この直観を定量化すると、*Lstart*（展開開始における最大の損失）と*Lneutral*（運用上の損益分岐点*）となります。このフェーズでは、展開は赤字状態で運営され、外部からの資本化が必要です。2

特筆すべきは、データ要件がべき乗則に従ってスケールするため、この差は桁違いの規模となり、結果として非常に大きな総コストとなることです。さらに重要なのは、損益分岐率がそのアレトリック（事象固有）下限に近いタスク（*Lneutral* ~ *Aj(φ)）は資本の吸い込み源であるということです。これは、展開をスケールする前に広範な投資を行うべきという定量的根拠となります。**

最適化されていない基盤モデルの下で実現可能な商業的展開を達成するためには、統合業者は環境のばらつきを人工的に制限し、実質的にタスクの本次元性を縮小する必要があります。§3.2 に示されるように、タスクの本次元性をより小さな *dj* に減らすことは ⇒ より大きな *βj* を意味します：収束が急峻になりますが、それは狭く転用不可能な多様体上でのみ起こります。*したがって、これらの到達可能な商業的ニッチ内で収集されたデータはエントロピーが低く、一般化された基盤モデルを進展させるために必要な情報密度がほとんど含まれていません。

これは自己強化型の欠陥を生み出します。構造化された運用セルは低エントロピーで相関のあるデータを生成しますが、このデータはモデルのより広範な汎化境界を広げることに失敗し、システムを永久的に初期のニッチに制限してしまいます。断片化さればらつきが小さいニッチ内で動作することは、タスクごとに多額の非反復的エンジニアリングオーバーヘッドを伴います。ソフトウェアのような利益率を実現するためには、逐次的なタスクの統合コストの限界値は漸近的にゼロに近づかなければなりません。狭くばらつきが少ない展開テレメトリは、この統合オーバーヘッドを削減する能力を持ちません。

この定式化は、エコシステム内の分岐した視点を統一します：

ステージ化バイアス（介入データ）：アクション密度は高いが、人工的に構造化されている。このデータタイプはモデルプロバイダーに好まれている（Vedder の投稿）。有界でクリーンなシミュレーションまたは実験室環境からサンプリングされる。明示的な軌道マッピングには優れるが、現実世界の物理におけるカオスなアレルティック失敗の尾部をサンプリングできない。
分布バイアス（展開データ）：現実世界での存在度は高いが、人工的に狭められている。これはニュー・インテグレーターが頼りにする柱である（Beard の投稿）。商業的持続可能性を維持するため、システムは低分散の運用ニッチに制限される。誤った分布混合体からサンプリングされ、エントロピーが低く相関したデータを生成し、一般化表現を駆動できない。

狭い応用から広い応用へと順次展開するのは、展開可能なタスクの拡張速度が複合する NRE 統合欠損（NRE integration deficit）を上回る場合にのみ経済的に実現可能である。商業ニッチからの展開データはこの拡張を駆動できないため、モデルには外部入力が必要となる：アレルティック下限（Aleatoric floor, Aj）を低下させるための「観測的広さ」と、モデルの一般化境界を拡張するための「介入的多様性」である。このベースラインの広さが確立されて初めて、生産のフライングホイールは回転し始める。

データエンジニアリングパイプラインは、累積運用時間を主要指標として廃止すべきである。代わりにチームは、計算的推定可能性に基づいて順序付けられた、定量化可能なタスク固有のテレメトリを追跡する必要がある。

マージナル統合コスト：新規タスクあたりの非反復的エンジニアリングコストをプロジェクト会計を通じて追跡する。この指標がタスクリストの拡大に伴って減少しない場合、基盤となるモデル層はタスク間表現を複利化しておらず、ビジネスモデルはスケーラブルなソフトウェアから線形システム統合へと後退している。
タスク別飽和点 (nc)：タスク固有または環境固有の学習曲線が横ばいになる転換点を特定する。この境界でデータ収集を停止することで、手動遠隔操作予算における資本浪費の主要な要因を抑制できる。
分布ドリフト (vj)：分布外 (OOD: out-of-distribution) 入力の速度と、必要なモデル再トレーニングの頻度を監視する。非定常的な目標分布は継続的に情報豊富な失敗モードを生成するため、継続的デプロイメントテレメトリが持続的なデータ優位性を生み出す唯一の運用レジームとなる。
クラスター被覆率：生のエピソード数を追跡するのではなく、標準的なデータ埋め込み内における直交するタスク、オブジェクト、および環境クラスターの量を定量化する。クラスター拡大の縦断的傾向は、ドメイン横断的な一般化のための代理指標として機能する。
データ新規性密度：アンサンブル不一致や記録された状態における予測分散などの能動学習ヒューリスティックを用いて、流入ストリームの情報密度を推定する。これにより、低エントロピーの日常的な運用成功事例がフィルタリングされ、高有用性の失敗テールに優先度が割り当てられる。

測定不能な不確実性の境界。 実現可能性を支配する主要変数である、事象的下限 *Aj(φ)* は直接マッピングできない。*L(D) = E + B D-β* の適合により漸近値 *E* を分離することで近似することは可能だが、近似上の課題のため、直接的な使用は不可能である。

物理的 AI サプライチェーン内における組織の位置は、そのデータ可視性、運用上の焦点、そして盲点を決定づけます。

モデルファースト型ラボ：多様な身体表現を持つ観測データコーパスの大規模なキュレーションとクリーニングを通じた事前学習に焦点を当てる。この広範さが複合的な一般化能力を駆動する。世界モデル型ラボは、学習済みモデルを用いて介入データを安価に製造するためのデータ作成に賭ける。しかし、段階的に設計された介入データの過度な使用により依然として大きなボトルネックが存在する。静的な事前学習や合成シミュレーションでは正確に再現できないエッジケース展開におけるアレルティック（確率的）失敗のテール部分である。
垂直統合型プレイヤー：展開テレメトリに焦点を当て、独自ハードウェア上でデータ収集とクリーニングを直接行う。ハードウェアに最適化されたデータは効率的だが、この戦略にはボトルネックがある。自律走行のような自然な変動が大きいドメインを除き、展開済みシステムは循環性の罠に陥る：商業的に存続するためには低変動環境への運用制限を余儀なくされ、それが広範なモデル一般化を駆動する新規性の低いデータを生み出す。
ネオ・インテグレーター：多様な産業環境に浅い運用足跡を維持する。タスクの多様性（複合的なスケーリング項）を収穫できる立場にある。しかし、彼らのビジネスモデルは通常、この足跡を課金の対象として扱うだけで、能動的なデータキュレーションの場とはみなさず、これは戦略的誤りである。
遠隔操作ベンダー：運用時間の販売を通じてデータ作成を収益化する。経済的なインセンティブが固有サンプルのカバレッジではなく生体ボリュームの最大化にあるため、タスクごとの飽和閾値（nc）を超えて運営される。彼らは局所的な収益を生むインフラストラクチャ用ユーティリティ（"シャベル"）を販売するが、スケーリング上の優位性は提供しない。
ハードウェアの既成勢力：決定論的モーション再生のために設計された利益率が高く変動の少ない市場セグメントを防衛する。学習のためのデータ収集は少なく、スケーリング曲線を上る道筋を持たない。

物理 AI において最も希少な能力は、データの新奇性を特定し捕捉することである。価値は、研究とハードウェアエンジニアリングという従来の組織的区分に関わらず、分布外の変動を切り離すことができる運用チームに体系的に蓄積される。

トークン単位で基盤モデルをレンタルするソフトウェアアプリケーション層（例：Cursor、Harvey）の成功は、モデルの前学習やデータの新奇性を優先しなくても価値を捕捉できることを示唆しています。しかし、経済的価値の捕捉とモデルの能力は独立した変数です。下流のアプリケーションがワークフロー統合と独自配布によって見事に収益化している一方で、それらの技術的制約は物理 AI において以下の3つの軸で異なります：

タスク次元と飽和：データの有用性は、対象ドメインの内在的次元によって決定されます。ソフトウェア開発は高い内在的次元を示し、継続的なワークフローフィードバックが継続的な限界効用をもたらします。一方、多くの物理タスク（例：構造化された倉庫ピッキング）は低い内在的次元を持ち、結果としてタスク固有のデータストリームは急速に飽和し、逓増するリターンの領域へと素早く移行します。
基盤モデルの非対称性：ソフトウェアアプリケーションは、一般化され大幅な補助を受けた基盤モデルの下流で動作します。物理 AI には比較可能な賃貸可能な基盤層が存在しないため、現在のロボット導入戦略では、運用上の存続性を維持するために環境変動を人為的に低減する必要があります。この制約により、データ収集はより広範な一般化を促進できない専門的なサブセットに限定されてしまいます。
テレメトリとマージンの制約：ソフトウェア環境では、運用ループ全体（ソースコード、ユーザーによる変更、コンパイル結果など）の完全かつ低コストな観測が可能です。一方、物理テレメトリは収集コストが高く、センサー分解能の閾値（アレルトリックフロア）により本質的に観測不足となります。さらに、物理 AI の基礎的な観測データが競合的かつ専有されたままの場合、レバレッジは上流のモデル層に集中します。インフラストラクチャプロバイダーは独占価格設定権を維持し、下流アプリケーションのマージンを圧迫することになります。

生データの量（例：累積運用時間）を追跡してデータ予算を最適化することは、モデル性能に対する効果的な指標ではありません。段階的な介入データも狭い範囲の導入データも、単独で基盤モデルをスケールさせることはできません。 段階的な設定における高ボリュームの介入データ収集は、タスクごとにサンプルカバレッジが急速に飽和するため逓増するリターンをもたらします。一方、商業的導入を低コストなデータ収集戦略として依存することは、経済的な罠を導入することになります：収益を生む商業的ニッチは、モデルの一般化を改善するために必要な新規性を一般的に欠いています。その代わり、各新しい環境では、狭いモデルをエラーハンドリングと統合するために非経常的なエンジニアリングコストが発生します。

データエンジニアリングのパイプラインは、累積稼働時間を主要指標として廃止すべきである。エンジニアリングの効率性とモデルのスケーラビリティは、定量化可能なパラメータを用いて評価されるべきである：タスクあたりの限界エンジニアリング統合コスト、タスクごとの飽和閾値、データ埋め込み内のクラスタカバレッジ、および分布ドリフト（*vj*）。

最適な資本配分戦略は、データタイプをその固有のユーティリティ指標と対比させることでバランスを取るものである：

観測的広さ：低コストで多様な観測データを優先し、アレルトリック誤差の下限を引き下げ、ベースライン能力の境界を広げる。
インターベンショナル（介入）段階化：高コストのインターベンショナル実証データは、タスクの飽和閾値（nc）に厳格に制限し、残りの予算を反復的な試行ではなくタスクの多様性へ再配分する。
展開テレメトリ：生産環境ストリームをフィルタリングして、分布外のエッジケースと故障モードを特定し、情報密度に乏しい高ボリュームの定型的成功事例は廃棄する。
初期展開におけるコストの流出：初期展開が有用なシグナルを提供する場合もあるが、損益分岐点に達するまでの継続的な展開は資本の浪費である。

Physical AI における Moneyball： 究極的に、資本効率性はデータ量の最大化ではなく、データの新奇性を正確に評価することによってスケールする。

これらのフレームワークがあなたの研究や戦略計画に有用であると考える場合は、本記事の引用をご検討ください：

*U* の一意な例と *r = T/U* 回のパスにおいて、実効的なデータセットサイズは *Deff = U·f(r)* としてスケーリングし、*f(r)* は指数関数的に飽和します。

マージン *Δsafe(φ) = Lneutral - Aj(φ)* が 0 に近づくにつれて乖離は拡大します：データから損失 L に到達するまでのスケーリングは (L − A_j)^(−1/β_j) となります。したがって、損益分岐目標が aleatoric floor（本質的誤差の下限）に近づくと、指数 β_j が固定されているにもかかわらず、必要なデータ量—ひいてはコスト—は超線形的に膨れ上がります。

原文を表示

In 2002, the Oakland Athletics won 103 games despite maintaining the third-lowest payroll in Major League Baseball. This advantage emerged because the market for player assets was mispriced: legacy scouts favored subjective aesthetics, stolen bases, and batting averages, whereas forward-looking management mathematically isolated on-base percentage, the statistic that actually correlated with runs.

*Finding the signal with the correct statistic in a field full of intuitive pundits:* Moneyball!

Data for Physical AI is misunderstood, and mis-priced.

Being scale-pilled often amounts to “believe in data”. However, unlike text, robot data isn't available to be mined. Every useful hour is paid for, so collection scales linearly while costs don't fall. Recently, Ken Goldberg estimated that frontier robotics models might requireapproximately 100,000 years.

AGI revolution will not be supervised with Sweatshop Teleop.

To bypass this bottleneck, the industry has scaled manual teleoperation infrastructure. However, optimizing for cumulative operational hours replicates the “batting average” fallacy of early baseball: it prioritizes a visible, easily fundable metric that correlates weakly with actual downstream model performance. An alternative strategy proposes deploying robots into production to harvest telemetry as a zero-cost byproduct of operational revenue. This model introduces a subtler version of the same statistical error. The niches where deployment is possible today are the ones with least variance and yield low-entropy, correlated data streams with minimal marginal utility.

This essay builds a framework for the marginal utility of data, and uses it to discuss value accrual in Physical AI. We take the perspective of the scaling laws that guide how loss behaves with data, and the unit economics that govern what a dollar of data is worth. Together they give an approximate marginal utility per dollar, the on-base percentage of physical AI.

Capital efficiency scales not by maximizing data volume, but by accurately computing and pricing data novelty. **If you’d rather skip to conclusions, jump to recommendations.

Varied stakeholders have differing views on data. Conveniently, *each worldview happens to make their slice the most valuable.*

Foundation-model labs sell generalized model scale, as a result overweight the role of large-scale pretraining, operating under the assumption that raw compute scaling will eventually eliminate edge-case errors. Teleoperation vendors are infrastructural utility that prioritize and monetize raw operational hours, since their revenue scales with data volume rather than utility or novelty. Hardware incumbents operate on the assumption of environmental stationarity, since their solution fails out-of-distribution. And large camp of academic roboticists denies it is a data problem at all and expects physics, models, and control to close the gap without the deluge.

The key archetype to analyze is the *neo-integrator*. This model attempts to bypass data-collection bottlenecks by deploying specialized robotic units into commercial production, utilizing human-in-the-loop oversight to manage operational failures. The core thesis relies on an unproven economic flywheel: production telemetry will generate the novelty required to train multi-task capabilities. Evan Beard of Standard Bots makes the case at length. Kyle Vedder pushes back on deployment first, arguing that the environments willing to pay for early-stage deployment are naturally low-variance, creating a "novelty pump" constraint.

We analyze this debate through a neutral framework combining empirical scaling laws and the unit economics of data capture, isolating exactly which allocation strategy yields the highest model capability per dollar.

Data operations in physical AI map across three modalities, each defined by trade-offs between cost and information density:

Observational Data: Low-cost, high-breadth, action-deficient corpora (e.g., egocentric and exocentric video). This modality expands support of the representation, but lacks direct action supervision.
Interventional Data: High-cost, low-breadth, action-dense demonstrations (e.g., teleoperation). This modality maps explicit state-action trajectories but scales linearly with human labor.
Deployment Data: Endogenous telemetry generated by production systems, often running at a loss. This modality is un-curated and samples an environmental distribution dictated by commercial operations rather than algorithmic design.

Data maximization often introduces low-entropy noise that degrades training efficiency. As demonstrated by the C4 dataset in language modeling,* subset subtraction results in model improvements*. Notably, filtering boilerplate and near-duplicates to maximize distinct token coverage within a fixed budget.

As stakeholders, the questions we have to ask are these. What does a dollar buy in each type of data? Where does new information come from? And can deployment, the data we are paid to collect, widen the set of tasks we can deploy, or does it run dry quickly?

Evaluating a data pipeline is a capital-allocation problem**: balancing the marginal cost of data against novel information and ability to advance the model’s generalizability.

The scaling-law literature answers these questions on language models. What matters about a dataset goes beyond its size: how many distinct examples it holds, how diverse the mixture is, how often each example repeats, and how close new data is relative to existing data.

*Yes, as a power law with diminishing returns, down to a floor.* Test loss falls as a straight line in log-log against data, model size, and compute (Kaplan 2020). With size *N* and tokens *D*, under the joint scaling formulation (Hoffmann 2022) loss is modeled as:

The functional form is consistent, while the numerical values remain approximations (Besiroglu 2024). At the compute-optimal allocation the two reducible terms decay at the data rate and collapse to a one-dimensional envelope,

The constant E represents the model's irreducible predictive uncertainty.

*Yes, operating across independent axes from dataset volume.* A diverse data mixture yields two simultaneous effects: it drives down the asymptotic error floor via cross-domain transfer and expanded manifold coverage, and it increases the intrinsic dimension of the dataset (*dint*). In the resolution-limited regime *β ≈ 4/dint* for a smooth target, where *dint* is the intrinsic dimension of the data manifold (Sharma & Kaplan 2020; Bahri 2021).

Because β enters as an inverse of dimension, halving a task's intrinsic dimension roughly doubles the scaling exponent: the loss curve falls faster. But this is at the cost of convergence to an inferior optima which doesn’t yield generalization. To maximize generalization, pre-training distributions must deliberately avoid artificially low intrinsic dimensionality.

The data-mixing law (Ye et al. 2024) decomposes a mixture's loss into orthogonal per-domain power laws and cross-coupling terms, which dictate either positive transfer or negative interference.

Repetition provides marginal utility up to approximately four epochs, matching the efficiency of fresh tokens; beyond this threshold, utility decays rapidly, eventually degrading capability. Muennighoff et al. (2023) fit exponential saturation with half-life *R* ≈ 151*: four passes incur negligible penalty, while sixteen passes define a strict regime of diminishing returns where additional compute yields zero information gain. Furthermore, over-indexing on a narrow data fraction drives a localized double-descent anomaly in test loss and fundamentally degrades circuit mechanisms, specifically induction and copying heads, that govern in-context learning (Hernandez et al. (2022)). Repeating just 0.1% of a corpus 100 times collapses the downstream performance of an 800M-parameter model to that of a 400M-parameter baseline, demonstrating that even minor distributional redundancies act as massive capital drains.

Near-duplicates exist on a utility continuum bounded by exact repetition and entirely novel samples. Removing these redundancies improves model generalization while optimizing the token budget for distinct manifolds. Lee et al. (2021) found that individual sentences appearing over 60,000 times within the C4 corpus. Redundancy in large-scale corpora necessitates systematic deduplication to mitigate verbatim memorization while accelerating convergence velocity. Mechanistically, a small perturbation forces a model to map identical targets across a bounded neighborhood (*x* and *x + ε*), serving as an implicit consistency regularization. Consequently, the near-duplicates are very low utility. At moderate *ε, *regularization is useful, and as *ε *expands, it becomes a distinct data point.* Densely-sampling within a narrow neighborhood rapidly saturates local capacity, and hurts model performance.*

Rare, out-of-distribution (OOD) events yield outsized marginal utility because model performance at the scaling limit is constrained by the failure tail. Real-world physical distributions are heavy-tailed; scaling macro-capabilities emerges from mastering a Zipfian distribution of subskills acquired sequentially based on frequency (Michaud et al., 2023). Achieving frontier accuracy requires fitting these rare subpopulations, which collectively constitute a large volume of total operational density (Feldman, 2020). Consequently, optimizing a corpus by filtering for high-difficulty, low-frequency samples can bypass standard power-law scaling constraints entirely (Sorscher et al., 2022). Because these edge cases are rooted in real-world stochasticity, they are intractable to replicate via synthetic generation or structured staging. However, as the model’s known distribution expands, remaining novel variations become exponentially rarer, driving a steep increase in the marginal cost of discovery.

Summary:

More data buys a power law down to a floor.
Diversity lowers the floor at the cost of rate.
Repetition buys little and eventually hurts performance.
Near-duplicate data is the weakest of all, short of a deliberate small perturbation.
The long tail rare events are very informative, yet are increasingly costlier to discover.

In language modeling, compute is the binding constraint and data is abundant and low-cost. Conversely in robotics, useful data is strictly constrained by data acquisition costs. Consequently, the objective function shifts from maximizing compute efficiency to maximizing marginal loss reduction per dollar.

The global capability target is modeled as a convex combination over discrete task clusters *j *with assigned prior weights (*π j*). Each independent cluster obeys a distinct scaling envelope conditioned on environmental parameters:

a floor *Aj(φ)* over a data-reducible term, with exponent *βj ≈ 4/dj* set by the cluster’s intrinsic dimension (*dj*).

To optimize a finite capital allocation, resource expenditure must equalize the marginal value per dollar across all available collection and curation channels.

Interventional Channel: Active demonstration data carries a premium for direct action supervision. It triggers rapid volume saturation, yielding economic utility primarily through cross-task skill transfer. An action channel i purchases directional coverage Dj = Σi wij gi(ni) at price ci, with gi saturating and wij denotes the cross-domain transfer projection from channel i to task cluster j. Differentiating this return against capital expenditure maps the marginal utility per dollar to optimize resource allocation.

Observational Channel: Passive observational data carries a distinct price. Without action labels, it yields economic utility by optimizing the underlying representation space. It simultaneously suppresses the aleatoric floor and regularizes the scaling coefficient.

The floor depends on the sensors. The aleatoric term is irreducible only relative to the information state observed by the specific robot configuration. We formalize the floor on task cluster *j* under sensing configuration *φ *via conditional entropy: *Aj(φ)* *= E [H[a | sφ]]. *This total risk decomposes into an absolute physical limit and a sensor-addressable margin:

Operational Implication. An environmental variance that a low-resolution sensor cannot resolve manifests as stochastic aleatoric noise to that model, whereas a higher-fidelity sensor converts it into predictable epistemic error. Action data drives the data-reducible term down toward *Aj(φ);* while better sensing lowers *Aj(φ)* itself.

A task is viable only if break-even loss threshold is reachable Aj(φ) << Lneutral . If an optimal sensing yields Ajmin ≥ Lneutral, scaling data volume is mathematically futile. The system requires either hardware reconfiguration or an entirely different operational task.

Production telemetry behaves like an oil well following a steep decline curve: initial operations yield high-entropy failure modes, which rapidly decay into a low-utility, near-duplicate regime as anomalies are resolved. This localized distribution sampling undergoes exponential saturation: *Ueff (n) = U0 + ΔU(1 - e-n/n_c)*. Past the covering number (*nc*), the production stream collapses into pure repetition with low-utility.

High-value data is concentrated strictly within the failure tail; routine operational successes contain zero marginal utility. The value is in the failures, not the routine successes. Crucially, in deployment the net cost per datum after revenue offset (*cdep*) is endogenous, scales non-linearly, and is bounded by the model's instantaneous loss (*L*):

All terms are dollars per logged task-attempt; *v* is per-task completion value, *ρ(L)* the error rate, κ_int and κ_prod the per-attempt intervention and lost-throughput costs. Prior to reaching the operational break-even threshold (*cdep*~0), data collection operates at a deficit, meaning the early deployment phase must be capitalized as an R&D asset rather than funded by operational revenue.

Crossing the gap costs capital. The usual trope is *start performance with model performing at least at 95%, with error-handling with interventions, and at 99.5% the deployment becomes profitable*. We can quantify this intuition as *Lstart* (the largest loss for starting deployment) and *Lneutral* (the operational break-even threshold*).* During this phase, deployment operates at a deficit, requiring external capitalization.2

Notably, data requirement scales with a power law, so this difference is in orders of magnitude, hence a very large total cost. Even more importantly, a task whose break-even rate is near its aleatoric floor (*Lneutral* ~ *Aj(φ)) *is a capital sink, which is the quantitative case for spending on breadth before scaling deployment.

To achieve viable commercial deployment under a sub-optimal foundation model, integrators must artificially constrain environmental variance, effectively collapsing the task's intrinsic dimensionality. As shown in §3.2, reducing the task’s intrinsic dimension to a smaller *dj* ⇒ larger *βj*: convergence steepens but onto a narrow, non-transferable manifold. *Consequently, the data collected within these reachable commercial niches yields low entropy and contains negligible information density to advance a generalized foundation model.*

This creates a self-reinforcing deficit. Structured operational cells yield low-entropy, correlated data. This data fails to expand the model's broader generalization boundary, permanently restricting the system to its initial niche. Operating within fragmented, low-variance niches incurs heavy non-recurring engineering overhead per task. To unlock software-like margins, the marginal integration cost of sequential tasks must asymptotically approach zero. Narrow, low-variance deployment telemetry is incapable of reducing this integration overhead.

This formulation unifies the divergent perspectives within the ecosystem:

Staged Bias (Interventional Data): High action density, but artificially structured. This data type preferred by model providers (Vedder’s post). It samples from a bounded, clean simulation or laboratory environment. While it maps explicit trajectories well, it fails to sample the chaotic aleatoric failure tail of real-world physics.
Distributional Bias (Deployment Data): High real-world presence, but artificially narrowed. This is the pillar neo-integrators lean on(Beard’s post). To maintain commercial viability, systems are restricted to low-variance operational niches. It samples from the wrong distributional mixture, yielding low-entropy, correlated data that fails to drive generalized representation.

Working sequentially from narrow to broad applications is economically viable only if the expansion velocity of deployable tasks outpaces the compounding NRE integration deficit. Because deployment data from commercial niches cannot drive this expansion, the model requires exogenous inputs: *observational breadth* to depress the aleatoric floor (Aj), and *interventional diversity* to extend the model’s generalization boundary. The production flywheel turns only after this baseline breadth is established.

Data engineering pipelines should deprecate cumulative operational hours as a primary metric. Teams should instead track quantifiable task-specific telemetry, ordered by their computational estimability.

Marginal Integration Cost Track the non-recurring engineering cost per novel task via project accounting. If this metric fails to decay as the task portfolio scales, the underlying model layer is not compounding cross-task representations, shifting the business model from scalable software to linear system integration.
Per-Task Saturation Point (nc) Identify the inflection point where a task-specific or environment-specific learning curve flatlines. Ceasing data collection at this boundary mitigates the main driver of capital waste in manual teleoperation budgets.
Distributional Drift (vj) Monitor the velocity of out-of-distribution (OOD) inputs and the frequency of required model retraining. A non-stationary target distribution continually generates informative failure modes, making it the sole operational regime where continuous deployment telemetry yields a sustained data edge.
Cluster coverage, Quantify the volume of orthogonal task, object, and environmental clusters within a standard data embedding, rather than tracking raw episodic counts. The longitudinal trend of cluster expansion serves as a proxy for cross-domain generalization.
Data Novelty Density: Proxy the information density of incoming streams using active learning heuristics, such as ensemble disagreement or predictive variance at logged states. This filters out low-entropy, routine operational successes to prioritize the high-utility failure tail.

The Boundary of Unmeasurable Uncertainty. The primary variable governing feasibility, the aleatoric floor* Aj(φ),* cannot be directly mapped. While it can be approximated by fitting *L(D) = E + B D-β* and isolating the asymptote *E, *due to approximation challenges, it cannot be used directly.

An organization's position within the physical AI supply chain dictates its data visibility, operational focus, and blind spots.

Model-first labs: Focus on pre-training via massive curation and cleaning of cross-embodiment observational corpora. This breadth drives compounding generalization. World-model labs make a bet on data creation to manufacture interventional data cheaply with a learned model. However there remains a big bottleneck due to the heavy use of staged interventional data. The aleatoric failure tail of edge-case deployments, which neither static pre-training nor synthetic simulation can accurately replicate.
Vertical Integrated Players: Focus on deployment telemetry, owning data collection and cleaning directly on proprietary hardware. While hardware-aligned data is efficient, this strategy is bottlenecked. Outside of naturally high-variance domains like autonomous driving, deployed systems face a circularity trap: to survive commercially, they must restrict operations to low-variance environments, which yields low-novelty data that fails to drive broader model generalization.
Neo-integrators: Maintain shallow operational footprints across diverse industrial environments. They are positioned to harvest task diversity (the compounding scaling term). However, their business models typically treat this footprint as a billing surface rather than an active data curation landscape, which is a strategic error.
Teleoperation vendors: Monetize data creation by selling operational hours. Because their economic incentive is to maximize raw volume rather than unique sample coverage, they operate past the per-task saturation threshold (nc). They sell infrastructural utilities (”shovels”) that generate localized revenue but offer no scaling edge.
Hardware Incumbents : Defend profitable, low-variability market segments designed for deterministic motion replay. They collect little data for learning and lack a path up the scaling curve.

The scarcest capability in physical AI is the identification and capture of data novelty. Value will systematically accrue to the operations teams capable of isolating out-of-distribution variations, independent of traditional organizational divisions between research and hardware engineering.

The success of software application layers (e.g., Cursor, Harvey) that rent foundation models by the token suggests value can be captured without prioritizing model pretraining or data novelty. However, economic value capture and model capability are independent variables. While downstream applications successfully monetize workflow integration and proprietary distribution, their technical constraints differ from physical AI across three axes:

Task Dimensionality and Saturation: Data utility is determined by the intrinsic dimensionality of the target domain. Software development exhibits high intrinsic dimensionality, meaning continuous workflow feedback yields ongoing marginal utility. Conversely, many physical tasks (e.g., structured warehouse picking) possess low intrinsic dimensionality; consequently, task-specific data streams saturate rapidly, transitioning quickly into a regime of diminishing returns.
Foundation Asymmetries: Software applications operate downstream of generalized, heavily subsidized foundation models. Because physical AI lacks a comparable rentable foundation layer, current robotics deployment strategies must artificially reduce environmental variation to maintain operational viability. This constraint restricts data collection to specialized subsets that fail to drive broader generalization.
Telemetry and Margin Constraints: Software environments permit complete, low-cost observation of the entire operational loop (e.g., source code, user modifications, and compilation outcomes). Physical telemetry is costly to capture and inherently under-observed due to sensor resolution thresholds (the aleatoric floor). Furthermore, if the foundational observational data for physical AI remains rivalrous and proprietary, leverage will concentrate at the upstream model layer. Infrastructure providers will retain monopoly pricing power, compressing downstream application margins

Optimizing a data budget by tracking raw volume (e.g., cumulative operational hours) is an ineffective metric for model performance. **Neither staged interventional data nor narrow deployment data can scale a foundation model in isolation. High-volume interventional data collection in staged settings results in diminishing returns because sample coverage saturates rapidly on a per-task basis. In contrast, relying on commercial deployments as a low-cost data collection strategy introduces an economic trap: commercial niches that yield revenue generally lack the novelty required to improve model generalization. Instead, each new environment incurs non-recurring engineering costs to integrate the narrow model with error handling.

Data engineering pipelines should deprecate cumulative operational hours as a primary metric. Engineering efficiency and model scaling should be evaluated using quantifiable parameters: marginal engineering integration cost per task, per-task saturation thresholds, cluster coverage within data embeddings, and distribution drift (vj).

An optimal capital allocation strategy balances data types against their specific utility metrics:**

Observational Breadth: Prioritize low-cost, diverse observational data to lower the aleatoric error floor and expand the baseline capability boundary.
Interventional Staging: Limit high-cost interventional demonstration data strictly to the task’s saturation threshold (nc), reallocating the remaining budget to task diversity rather than repetitive iterations.
Deployment Telemetry: Filter production streams to isolate out-of-distribution edge cases and failure modes, discarding high-volume routine successes that lack information density.
Cost hemorrhage in early deployment: While early deployment may provide some useful signal, continued deployment before breakeven is wasted capital.

Moneyball for Physical AI: Ultimately, capital efficiency scales not by maximizing data volume, but by accurately pricing data novelty.

If you find these frameworks useful for your research or strategic planning, please consider citing this post:

Given *U* unique examples and *r = T/U* passes, effective dataset size scales as *Deff = U·f(r)* where *f(r)* saturates exponentially.

The gap diverges as the margin *Δsafe(φ) = Lneutral - Aj(φ)→ 0:* data-to-reach-loss-L scales as (L − A_j)^(−1/β_j), so as the break-even target approaches the aleatoric floor, required data—and thus cost—blows up super-linearly even though the exponent β_j is fixed.

この記事をシェア

Ars Technica AI重要度42026年6月30日 06:09

韓国、半導体生産と人型ロボットに1兆ドル投資へ

TLDR AI2026年6月29日 09:00

Google、NotebookLM でノートブックコレクションのテストを開始

TLDR AI2026年6月29日 09:00

信念のウェブとしてのエージェント（11 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月29日 09:00·約31分

物理的 AI におけるマネーボール戦略（26 分読）

#Physical AI #Data Strategy #Robotics #Quantitative Analysis

TL;DR

AI深層分析2026年6月30日 02:05

重要/ 5段階

深度40%

キーポイント

市場のミスプライシングと直感の罠

物理的 AI における真のシグナル

データ駆動型戦略への転換

影響分析・編集コメントを表示

影響分析

編集コメント

*直感に頼る評論家であふれる分野の中で、正しい統計を用いてシグナルを見つけること:* マネーボール！

物理的 AI（Physical AI）におけるデータは誤解され、不当に評価されています。

AGI（人工一般知能）革命は、劣悪な労働環境での遠隔操作（Sweatshop Teleop）によって成し遂げられるものではありません。

多様な利害関係者はデータに対して異なる見解を持っています。都合よくも、*それぞれの世界観が自らの部分を最も価値あるものだと主張します。*

物理 AI におけるデータ運用は、コストと情報密度の間のトレードオフによって定義される 3 つのモダリティにまたがります:

観測データ：低コストで広範な範囲をカバーするが行動情報が不足したコーパス（例：主観視点および客観視点の動画）。このモダリティは表現のサポートを広げますが、直接的な行動監督には欠けます。
介入データ：高コストで範囲は狭いが行動密度の高いデモンストレーション（例：遠隔操作）。このモダリティは明示的な状態・行動軌跡をマッピングしますが、人的労働に対して線形にスケーリングします。
展開データ：生産システムによって生成される内生テレメトリであり、多くの場合損失を出しながら稼働しています。このモダリティは未整理であり、アルゴリズム設計ではなく商業運営によって決定された環境分布をサンプリングします。

定数 E は、モデルの不可避的な予測不確実性を表しています。

要約:

データ量を増やすことは、一定の下限までべき乗則を達成できる。
多様性は、取得速度を犠牲にしてその下限を下げる。
反復はほとんど効果をもたらず、最終的には性能を低下させる。
ほぼ重複するデータは、意図的な微小摂動を除けば最も弱いものである。

長期の裾野にある稀な事象は非常に情報量が多いが、発見にかかるコストは次第に高騰している。

データ削減可能な項に対する下限 Aj(φ) は、指数 βj ≈ 4/dj によって決定され、これはクラスターの内在次元（dj）に依存する。

インターベンションチャネル：アクティブなデモンストレーションデータは、直接行動の監督に対してプレミアム価格を持ちます。これは急速なボリューム飽和を引き起こし、主にクロスタスクスキル転移を通じて経済的効用を生み出します。ある行動チャネル i は、方向性カバレッジ Dj = Σi wij gi(ni) を価格 ci で購入します。ここで gi は飽和し、wij はチャネル i からタスククラスター j へのドメイン間転送投影を表します。このリターンを資本支出に対して微分することで、1 ドルあたりの限界効用がマッピングされ、資源配分の最適化が行われます。
オブザーベーションチャネル：パッシブな観測データは独自の価格を持ちます。行動ラベルがない場合、これは基盤となる表現空間の最適化を通じて経済的効用を生み出します。同時に、アレルトリック（事象的）下限を抑制し、スケーリング係数を正則化します。

この定式化は、エコシステム内の分岐した視点を統一します：

ステージ化バイアス（介入データ）：アクション密度は高いが、人工的に構造化されている。このデータタイプはモデルプロバイダーに好まれている（Vedder の投稿）。有界でクリーンなシミュレーションまたは実験室環境からサンプリングされる。明示的な軌道マッピングには優れるが、現実世界の物理におけるカオスなアレルティック失敗の尾部をサンプリングできない。
分布バイアス（展開データ）：現実世界での存在度は高いが、人工的に狭められている。これはニュー・インテグレーターが頼りにする柱である（Beard の投稿）。商業的持続可能性を維持するため、システムは低分散の運用ニッチに制限される。誤った分布混合体からサンプリングされ、エントロピーが低く相関したデータを生成し、一般化表現を駆動できない。

マージナル統合コスト：新規タスクあたりの非反復的エンジニアリングコストをプロジェクト会計を通じて追跡する。この指標がタスクリストの拡大に伴って減少しない場合、基盤となるモデル層はタスク間表現を複利化しておらず、ビジネスモデルはスケーラブルなソフトウェアから線形システム統合へと後退している。
タスク別飽和点 (nc)：タスク固有または環境固有の学習曲線が横ばいになる転換点を特定する。この境界でデータ収集を停止することで、手動遠隔操作予算における資本浪費の主要な要因を抑制できる。
分布ドリフト (vj)：分布外 (OOD: out-of-distribution) 入力の速度と、必要なモデル再トレーニングの頻度を監視する。非定常的な目標分布は継続的に情報豊富な失敗モードを生成するため、継続的デプロイメントテレメトリが持続的なデータ優位性を生み出す唯一の運用レジームとなる。
クラスター被覆率：生のエピソード数を追跡するのではなく、標準的なデータ埋め込み内における直交するタスク、オブジェクト、および環境クラスターの量を定量化する。クラスター拡大の縦断的傾向は、ドメイン横断的な一般化のための代理指標として機能する。
データ新規性密度：アンサンブル不一致や記録された状態における予測分散などの能動学習ヒューリスティックを用いて、流入ストリームの情報密度を推定する。これにより、低エントロピーの日常的な運用成功事例がフィルタリングされ、高有用性の失敗テールに優先度が割り当てられる。

物理的 AI サプライチェーン内における組織の位置は、そのデータ可視性、運用上の焦点、そして盲点を決定づけます。

モデルファースト型ラボ：多様な身体表現を持つ観測データコーパスの大規模なキュレーションとクリーニングを通じた事前学習に焦点を当てる。この広範さが複合的な一般化能力を駆動する。世界モデル型ラボは、学習済みモデルを用いて介入データを安価に製造するためのデータ作成に賭ける。しかし、段階的に設計された介入データの過度な使用により依然として大きなボトルネックが存在する。静的な事前学習や合成シミュレーションでは正確に再現できないエッジケース展開におけるアレルティック（確率的）失敗のテール部分である。
垂直統合型プレイヤー：展開テレメトリに焦点を当て、独自ハードウェア上でデータ収集とクリーニングを直接行う。ハードウェアに最適化されたデータは効率的だが、この戦略にはボトルネックがある。自律走行のような自然な変動が大きいドメインを除き、展開済みシステムは循環性の罠に陥る：商業的に存続するためには低変動環境への運用制限を余儀なくされ、それが広範なモデル一般化を駆動する新規性の低いデータを生み出す。
ネオ・インテグレーター：多様な産業環境に浅い運用足跡を維持する。タスクの多様性（複合的なスケーリング項）を収穫できる立場にある。しかし、彼らのビジネスモデルは通常、この足跡を課金の対象として扱うだけで、能動的なデータキュレーションの場とはみなさず、これは戦略的誤りである。
遠隔操作ベンダー：運用時間の販売を通じてデータ作成を収益化する。経済的なインセンティブが固有サンプルのカバレッジではなく生体ボリュームの最大化にあるため、タスクごとの飽和閾値（nc）を超えて運営される。彼らは局所的な収益を生むインフラストラクチャ用ユーティリティ（"シャベル"）を販売するが、スケーリング上の優位性は提供しない。
ハードウェアの既成勢力：決定論的モーション再生のために設計された利益率が高く変動の少ない市場セグメントを防衛する。学習のためのデータ収集は少なく、スケーリング曲線を上る道筋を持たない。

タスク次元と飽和：データの有用性は、対象ドメインの内在的次元によって決定されます。ソフトウェア開発は高い内在的次元を示し、継続的なワークフローフィードバックが継続的な限界効用をもたらします。一方、多くの物理タスク（例：構造化された倉庫ピッキング）は低い内在的次元を持ち、結果としてタスク固有のデータストリームは急速に飽和し、逓増するリターンの領域へと素早く移行します。
基盤モデルの非対称性：ソフトウェアアプリケーションは、一般化され大幅な補助を受けた基盤モデルの下流で動作します。物理 AI には比較可能な賃貸可能な基盤層が存在しないため、現在のロボット導入戦略では、運用上の存続性を維持するために環境変動を人為的に低減する必要があります。この制約により、データ収集はより広範な一般化を促進できない専門的なサブセットに限定されてしまいます。
テレメトリとマージンの制約：ソフトウェア環境では、運用ループ全体（ソースコード、ユーザーによる変更、コンパイル結果など）の完全かつ低コストな観測が可能です。一方、物理テレメトリは収集コストが高く、センサー分解能の閾値（アレルトリックフロア）により本質的に観測不足となります。さらに、物理 AI の基礎的な観測データが競合的かつ専有されたままの場合、レバレッジは上流のモデル層に集中します。インフラストラクチャプロバイダーは独占価格設定権を維持し、下流アプリケーションのマージンを圧迫することになります。

最適な資本配分戦略は、データタイプをその固有のユーティリティ指標と対比させることでバランスを取るものである：

観測的広さ：低コストで多様な観測データを優先し、アレルトリック誤差の下限を引き下げ、ベースライン能力の境界を広げる。
インターベンショナル（介入）段階化：高コストのインターベンショナル実証データは、タスクの飽和閾値（nc）に厳格に制限し、残りの予算を反復的な試行ではなくタスクの多様性へ再配分する。
展開テレメトリ：生産環境ストリームをフィルタリングして、分布外のエッジケースと故障モードを特定し、情報密度に乏しい高ボリュームの定型的成功事例は廃棄する。
初期展開におけるコストの流出：初期展開が有用なシグナルを提供する場合もあるが、損益分岐点に達するまでの継続的な展開は資本の浪費である。

Physical AI における Moneyball： 究極的に、資本効率性はデータ量の最大化ではなく、データの新奇性を正確に評価することによってスケールする。

これらのフレームワークがあなたの研究や戦略計画に有用であると考える場合は、本記事の引用をご検討ください：

*U* の一意な例と *r = T/U* 回のパスにおいて、実効的なデータセットサイズは *Deff = U·f(r)* としてスケーリングし、*f(r)* は指数関数的に飽和します。

原文を表示

*Finding the signal with the correct statistic in a field full of intuitive pundits:* Moneyball!

Data for Physical AI is misunderstood, and mis-priced.

AGI revolution will not be supervised with Sweatshop Teleop.

Capital efficiency scales not by maximizing data volume, but by accurately computing and pricing data novelty. **If you’d rather skip to conclusions, jump to recommendations.

Varied stakeholders have differing views on data. Conveniently, *each worldview happens to make their slice the most valuable.*

We analyze this debate through a neutral framework combining empirical scaling laws and the unit economics of data capture, isolating exactly which allocation strategy yields the highest model capability per dollar.

Data operations in physical AI map across three modalities, each defined by trade-offs between cost and information density:

Observational Data: Low-cost, high-breadth, action-deficient corpora (e.g., egocentric and exocentric video). This modality expands support of the representation, but lacks direct action supervision.
Interventional Data: High-cost, low-breadth, action-dense demonstrations (e.g., teleoperation). This modality maps explicit state-action trajectories but scales linearly with human labor.
Deployment Data: Endogenous telemetry generated by production systems, often running at a loss. This modality is un-curated and samples an environmental distribution dictated by commercial operations rather than algorithmic design.

Evaluating a data pipeline is a capital-allocation problem**: balancing the marginal cost of data against novel information and ability to advance the model’s generalizability.

The constant E represents the model's irreducible predictive uncertainty.

Because β enters as an inverse of dimension, halving a task's intrinsic dimension roughly doubles the scaling exponent: the loss curve falls faster. But this is at the cost of convergence to an inferior optima which doesn’t yield generalization. To maximize generalization, pre-training distributions must deliberately avoid artificially low intrinsic dimensionality.

The data-mixing law (Ye et al. 2024) decomposes a mixture's loss into orthogonal per-domain power laws and cross-coupling terms, which dictate either positive transfer or negative interference.

Summary:

More data buys a power law down to a floor.
Diversity lowers the floor at the cost of rate.
Repetition buys little and eventually hurts performance.
Near-duplicate data is the weakest of all, short of a deliberate small perturbation.
The long tail rare events are very informative, yet are increasingly costlier to discover.

a floor *Aj(φ)* over a data-reducible term, with exponent *βj ≈ 4/dj* set by the cluster’s intrinsic dimension (*dj*).

To optimize a finite capital allocation, resource expenditure must equalize the marginal value per dollar across all available collection and curation channels.

Interventional Channel: Active demonstration data carries a premium for direct action supervision. It triggers rapid volume saturation, yielding economic utility primarily through cross-task skill transfer. An action channel i purchases directional coverage Dj = Σi wij gi(ni) at price ci, with gi saturating and wij denotes the cross-domain transfer projection from channel i to task cluster j. Differentiating this return against capital expenditure maps the marginal utility per dollar to optimize resource allocation.

Observational Channel: Passive observational data carries a distinct price. Without action labels, it yields economic utility by optimizing the underlying representation space. It simultaneously suppresses the aleatoric floor and regularizes the scaling coefficient.

A task is viable only if break-even loss threshold is reachable Aj(φ) << Lneutral . If an optimal sensing yields Ajmin ≥ Lneutral, scaling data volume is mathematically futile. The system requires either hardware reconfiguration or an entirely different operational task.

This formulation unifies the divergent perspectives within the ecosystem:

Staged Bias (Interventional Data): High action density, but artificially structured. This data type preferred by model providers (Vedder’s post). It samples from a bounded, clean simulation or laboratory environment. While it maps explicit trajectories well, it fails to sample the chaotic aleatoric failure tail of real-world physics.
Distributional Bias (Deployment Data): High real-world presence, but artificially narrowed. This is the pillar neo-integrators lean on(Beard’s post). To maintain commercial viability, systems are restricted to low-variance operational niches. It samples from the wrong distributional mixture, yielding low-entropy, correlated data that fails to drive generalized representation.

Marginal Integration Cost Track the non-recurring engineering cost per novel task via project accounting. If this metric fails to decay as the task portfolio scales, the underlying model layer is not compounding cross-task representations, shifting the business model from scalable software to linear system integration.
Per-Task Saturation Point (nc) Identify the inflection point where a task-specific or environment-specific learning curve flatlines. Ceasing data collection at this boundary mitigates the main driver of capital waste in manual teleoperation budgets.
Distributional Drift (vj) Monitor the velocity of out-of-distribution (OOD) inputs and the frequency of required model retraining. A non-stationary target distribution continually generates informative failure modes, making it the sole operational regime where continuous deployment telemetry yields a sustained data edge.
Cluster coverage, Quantify the volume of orthogonal task, object, and environmental clusters within a standard data embedding, rather than tracking raw episodic counts. The longitudinal trend of cluster expansion serves as a proxy for cross-domain generalization.
Data Novelty Density: Proxy the information density of incoming streams using active learning heuristics, such as ensemble disagreement or predictive variance at logged states. This filters out low-entropy, routine operational successes to prioritize the high-utility failure tail.

An organization's position within the physical AI supply chain dictates its data visibility, operational focus, and blind spots.

Model-first labs: Focus on pre-training via massive curation and cleaning of cross-embodiment observational corpora. This breadth drives compounding generalization. World-model labs make a bet on data creation to manufacture interventional data cheaply with a learned model. However there remains a big bottleneck due to the heavy use of staged interventional data. The aleatoric failure tail of edge-case deployments, which neither static pre-training nor synthetic simulation can accurately replicate.
Vertical Integrated Players: Focus on deployment telemetry, owning data collection and cleaning directly on proprietary hardware. While hardware-aligned data is efficient, this strategy is bottlenecked. Outside of naturally high-variance domains like autonomous driving, deployed systems face a circularity trap: to survive commercially, they must restrict operations to low-variance environments, which yields low-novelty data that fails to drive broader model generalization.
Neo-integrators: Maintain shallow operational footprints across diverse industrial environments. They are positioned to harvest task diversity (the compounding scaling term). However, their business models typically treat this footprint as a billing surface rather than an active data curation landscape, which is a strategic error.
Teleoperation vendors: Monetize data creation by selling operational hours. Because their economic incentive is to maximize raw volume rather than unique sample coverage, they operate past the per-task saturation threshold (nc). They sell infrastructural utilities (”shovels”) that generate localized revenue but offer no scaling edge.
Hardware Incumbents : Defend profitable, low-variability market segments designed for deterministic motion replay. They collect little data for learning and lack a path up the scaling curve.

Task Dimensionality and Saturation: Data utility is determined by the intrinsic dimensionality of the target domain. Software development exhibits high intrinsic dimensionality, meaning continuous workflow feedback yields ongoing marginal utility. Conversely, many physical tasks (e.g., structured warehouse picking) possess low intrinsic dimensionality; consequently, task-specific data streams saturate rapidly, transitioning quickly into a regime of diminishing returns.
Foundation Asymmetries: Software applications operate downstream of generalized, heavily subsidized foundation models. Because physical AI lacks a comparable rentable foundation layer, current robotics deployment strategies must artificially reduce environmental variation to maintain operational viability. This constraint restricts data collection to specialized subsets that fail to drive broader generalization.
Telemetry and Margin Constraints: Software environments permit complete, low-cost observation of the entire operational loop (e.g., source code, user modifications, and compilation outcomes). Physical telemetry is costly to capture and inherently under-observed due to sensor resolution thresholds (the aleatoric floor). Furthermore, if the foundational observational data for physical AI remains rivalrous and proprietary, leverage will concentrate at the upstream model layer. Infrastructure providers will retain monopoly pricing power, compressing downstream application margins

Data engineering pipelines should deprecate cumulative operational hours as a primary metric. Engineering efficiency and model scaling should be evaluated using quantifiable parameters: marginal engineering integration cost per task, per-task saturation thresholds, cluster coverage within data embeddings, and distribution drift (vj).

An optimal capital allocation strategy balances data types against their specific utility metrics:**

Observational Breadth: Prioritize low-cost, diverse observational data to lower the aleatoric error floor and expand the baseline capability boundary.
Interventional Staging: Limit high-cost interventional demonstration data strictly to the task’s saturation threshold (nc), reallocating the remaining budget to task diversity rather than repetitive iterations.
Deployment Telemetry: Filter production streams to isolate out-of-distribution edge cases and failure modes, discarding high-volume routine successes that lack information density.
Cost hemorrhage in early deployment: While early deployment may provide some useful signal, continued deployment before breakeven is wasted capital.

Moneyball for Physical AI: Ultimately, capital efficiency scales not by maximizing data volume, but by accurately pricing data novelty.

If you find these frameworks useful for your research or strategic planning, please consider citing this post:

Given *U* unique examples and *r = T/U* passes, effective dataset size scales as *Deff = U·f(r)* where *f(r)* saturates exponentially.

この記事をシェア

Ars Technica AI重要度42026年6月30日 06:09

韓国、半導体生産と人型ロボットに1兆ドル投資へ

TLDR AI2026年6月29日 09:00

Google、NotebookLM でノートブックコレクションのテストを開始

TLDR AI2026年6月29日 09:00

信念のウェブとしてのエージェント（11 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み