読み込み中…

AI Snake Oil·2024年12月19日 01:47·約24分

AIの進歩は鈍化しているのか？

#LLM #推論スケーリング #モデルサイズ #OpenAI #Anthropic

TL;DR

AI Snake Oil の著者らは、モデルのスケーリングが死んだとする楽観的・悲観的な両極端な議論に対し、証拠不足を指摘し、推論スケーリングの限界や社会影響との乖離を分析している。

AI深層分析2026年5月3日 04:06

重要/ 5段階

深度40%

キーポイント

スケーリング終焉説への懐疑

GPT-4 以降にデータ枯渇などの課題はあるが、新たなアイデアが試行錯誤されたという明確な証拠がない限り、「モデルスケーリングの死」を宣言するのは時期尚早である。

業界リーダーの予測の不確実性

OpenAI や Google のリーダーらが方針転換を示す事実は、彼らの予測が客観的証拠よりも自社の利害関係やナラティブに左右されやすいことを示している。

推論スケーリングの現実と限界

テスト時計算（推論スケーリング）は短期的な能力向上をもたらす可能性があるが、その効果は予測不能であり、ドメイン間で不均一に分布する可能性が高い。

能力向上と社会影響の乖離

AI の技術的能力が向上しても、それが即座に社会的・経済的なインパクトを生むとは限らず、ボトルネックは製品開発の速度や普及率にある。

モデルスケーリングの限界と代替案

大規模トレーニングの実施にもかかわらずモデルが未公開である背景には技術的課題や性能不足に加え、o1 の登場により「推論スケーリング」への転換を正当化する動きがある。

業界リーダーへの盲信の再考

スケーリングの限界が明白であるにもかかわらず長期間続いた楽観的なナラティブは、業界リーダーの保証に対するジャーナリストや一般の過度な信頼によるものであり、その正当性は疑問視されるべきである。

マルチモーダル学習の可能性

YouTube の動画データをトレーニングに組み込むことで新たな能力が unlocked される可能性はあるが、Google が他社へのデータライセンスを拒否する限り、誰かが試すまで結果は不明である。

重要な引用

"The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again."

Declaring the death of model scaling is premature.

The connection between capability improvements and AI's social or economic impacts is extremely weak.

Once there is a leak in the dam, it quickly bursts.

But is this deference justified?

given how many AI companies are close to the state of the art... we're talking about an advantage of at most a few months, which is minor in the context of, say, 3-year forecasts.

影響分析・編集コメントを表示

影響分析

この記事は、AI業界が直面している「スケーリングの限界」という議論に対し、感情的な楽観・悲観論に流されず、冷静な証拠に基づく分析を促す重要な役割を果たします。特に、技術的進歩と社会実装の間のギャップを指摘することで、投資家や開発者が短期的な技術ブームに踊らされることなく、長期的な製品戦略や普及プロセスに焦点を当てるよう警鐘を鳴らしています。

編集コメント

技術的な限界論と業界の動向を冷静に分析した良質な記事です。特に「能力向上＝社会インパクト」という単純な図式への疑問提示は、現在の AI 投資や戦略策定において非常に示唆に富んでいます。

アーヴィンド・ナラヤナン、ベネディクト・シュトロブ、サイアッシュ・カプールによる。

2023 年 3 月の GPT-4 のリリース後、テック業界における支配的な物語は、モデルの継続的なスケーリングが人工一般知能（AGI）そしてスーパーインテリジェンスへとつながるというものでした。これらの極端な予測は徐々に後退しましたが、1 ヶ月前まで、AI 業界での一般的な信念は、モデルのスケーリングが将来にわたって続くだろうというものでした。

その後、The Information、ロイター、ブルームバーグから連続して発表された 3 つの報道により、OpenAI、Anthropic、Google Gemini という 3 つの主要な AI 開発者が、次世代モデルにおいて問題に直面していることが明らかになりました。スケーリングの最も著名な提唱者の一人であるイリア・スツスケバー氏を含む多くの業界関係者たちは、今や全く異なる見解を表明しています。

「2010 年代はスケーリングの時代でしたが、私たちは再び驚きと発見の時代に戻りました。誰もが次の何かを探しています」とスツスケバー氏は述べています。「何を正しくスケーリングするかが、これまで以上に重要になっています。」（ロイター）

新たな支配的な物語は、モデルのスケーリングは死に、「推論スケーリング」、すなわち「テスト時計算量スケーリング」が AI 能力を向上させるための道筋であるというものです。この考え方は、タスクを実行する際にモデルに対してより多くの計算リソースを費やすことで、例えば回答する前に「考える」ようにさせることを目指しています。

これが、AI の能力における進歩が遅れているかどうかについて、AI 観察者を混乱させている。本稿では、この問いに関する証拠を検討し、4 つの主要な点を提示する。

モデルのスケーリング（拡大）の死を宣言するのは早計である。

モデルのスケーリングが継続するかどうかにかかわらず、業界リーダーがこの問題について一貫性のない態度を示していることは、彼らの予測を信頼することの愚かさを示している。彼らは我々一般の人々と比べて大幅に多くの情報を持っているわけではなく、その物語は彼らの既得権益によって強く影響されている。

推論のスケーリング（拡大）は現実のものであり、すぐに着手できる有望な課題が多数存在する。これにより、短期的には能力の急速な向上をもたらす可能性がある。しかし一般的に、推論スケーリングによる能力向上は予測不可能であり、かつドメイン間で不均一に分布する可能性が高い。

能力の向上と AI の社会的・経済的影響との間の関連性は極めて弱い。影響におけるボトルネックは、製品開発のペースと採用率であり、AI 自体の能力ではない。

モデルのスケーリングは死んだのか？

突然の雰囲気の変化を招いた新しい情報はほとんどない。当ニュースレターでは長らく、モデルのスケーリングには重要な逆風があると言ってきた。当時スケーリングへの過熱した期待に対して警告したように、今もモデルのスケーリングに対する過度な悲観主義に対して警告する必要がある。

技術用語：

モデルのスケーリング（Model Scaling）: 機械学習モデルのサイズや計算リソースを増大させる手法。
推論スケーリング（Inference Scaling）: 推論時の計算リソースやデータ量を増やすことで性能を向上させるアプローチ。

「いつものスケーリング」は GPT-4 クラスのモデルで終了しました。なぜなら、これらのモデルは入手可能なデータソースのほとんどを学習済みだからです。モデルのスケーリングを継続させるには新たなアイデアが必要であることはすでに知られていました。したがって、多くのそのような試みが行われ失敗したという証拠がない限り、モデルスケーリングにまだ余地があるとは結論付けられません。

一つの例として、マルチモーダルモデルのトレーニングミックスに YouTube の動画（文字起こしされたテキストではなく、実際の動画）を含めることで新たな機能が解放される可能性があります。あるいは、それが役立たないかもしれません。誰かが試してみなければわからないことですが、実際に試されたかどうかさえも不明です。なお、おそらく Google しかできないでしょう。なぜなら、同社が競合他社に YouTube のトレーニングデータをライセンスする可能性は極めて低いからです。

モデルスケーリングに関する状況がいまだこれほど不確実であるにもかかわらず、なぜナラティブ（物語）が逆転したのでしょうか？ GPT-4 の学習完了からすでに 2 年以上が経過しているため、「次世代のモデルは単に予想より少し時間がかかっているだけ」という考えは次第に信憑性を失っていました。そして、ある企業が問題があると認めた瞬間、他社もそれを認めることが格段に容易になります。堤防に一度ひびが入れば、すぐに決壊するものです。最後に、OpenAI の推論モデル o1 が登場したことで、企業側がモデルスケーリングで問題に直面したことを認める際の逃げ道ができました。彼らは「単にインファレンス（推論）のスケーリングへ移行する」と主張することで面子を保つことができるからです。

明確にしておきますと、多くの AI ラボがより大規模なトレーニングランを実施したにもかかわらず、その結果得られたモデルを公開していないという報告を疑う理由は全くありません。しかし、そこから何を結論づけるべきかはそれほど明確ではありません。より大きなモデルが公開されていない可能性のある理由としては以下のようなものが考えられます：

技術的な困難、例えば収束の失敗や、マルチデータセンターでのトレーニングランにおける耐障害性の達成に伴う複雑さ。

そのモデルは GPT-4 クラスのモデルと比べてそれほど優れておらず、公開するにはあまりにもがっかりさせるものになるため。

そのモデルは GPT-4 クラスのモデルと比べてそれほど優れておらず、そのため開発者は微調整を通じてより良いパフォーマンスを引き出すために長い時間を費やしているところ。

要約すると、モデルのスケーリングが実際に限界に達した可能性もありますが、これらの一時的なつまずきが最終的には解消され、いずれかの企業が技術的な困難を解決したり、新たなデータソースを見つけたりすることでそれを克服する方法を見つける可能性もあります。

内部関係者に委ねるのをやめよう

新しい物語がこれほど早く登場したことが奇妙であるだけでなく、モデルのスケーリングの潜在的な限界が明白であるにもかかわらず、古い物語が長く続いたことも興味深いです。その持続の主な理由は、スケーリングはあと数年間続くという業界リーダーたちの保証です。2 一般的に、ジャーナリスト（および他の多くの人々）は外部者よりも業界内部関係者に委ねる傾向があります。しかし、この委ねる姿勢は正当化されるのでしょうか？

業界のリーダーたちは、AI の進展を予測する上で必ずしも良い実績を持っているわけではありません。過去 10 年間の大半において、自動運転車に対する過度な楽観視がその好例です。（ただし、自律走行はついに現実のものとなりましたが、レベル 5 — 完全自動化 — はまだ存在していません。）余談ですが、内部関係者の予測の実績をよりよく理解するためには、過去 10 年間に著名な業界の内部関係者によって行われたすべての AI に関する予測について体系的な分析を行うと興味深いでしょう。

内部関係者の主張により多くの重みを置くべき理由もあれば、逆にその重みを減らすべき重要な理由もあります。これら一つずつ分析していきましょう。確かに、業界の内部関係者は、まだリリースされていないモデルのパフォーマンスなどといった独自の情報を有しており、それが将来に関する彼らの主張をより正確にする可能性があります。しかし、オープンソースでモデルの重み（weights）を公開し、科学的知見やデータセット、その他の成果物を共有する企業を含む、多くの AI 企業が最先端の状態に近いことを考えると、その優位性は最大でも数ヶ月程度であり、例えば 3 年先の予測という文脈ではさほど重要ではありません。

さらに、企業内部に保有されている追加情報の量、すなわち能力面や（特に）安全性の面で、私たちが過大評価している傾向があります。内部関係者たちは長年、「もしあなたが私たちの知っていることを知っていれば……」と警告していましたが、最終的に内部告発者が名乗り出た結果、彼らが主に relied していたのは他の誰もが行うのと同じ種類の推測に過ぎないことが判明しました。

内部関係者により重みを与えるもう一つの潜在的な理由は、その技術的専門知識です。私たちはこれを有力な理由とは考えていません：AI の専門知識は産業界と同様に学術界にも存在します。さらに重要なのは、深い技術的専門知識が、AI 予測に組み込まれるような粗末なトレンドの外挿を支持するためにそれほど重要ではないということです。また、技術的専門知識だけでは不十分で、ビジネスや社会要因も AI の進路を決定する上で少なくとも同等の役割を果たしています。自動運転車の事例では、そのような要因の一つは、社会が実験のために公道を使用することに対してどの程度許容するかという点です。大規模 AI モデルの事例については、以前からスケーリングが技術的に可能かどうかではなく、ビジネスとして意味があるかどうかこそが最も重要な要因であると主張してきました。したがって、技術者たちは大きな優位性を持っているわけではなく、むしろ技術的側面を過剰に強調する傾向が、過度な自信を持った予測をもたらす結果となっています。

要するに、内部関係者の見解により多くの重みを置く理由の多くは、それほど重要ではありません。一方、彼らの見解にあまり重みを置かないべきという、巨大で明白な理由が一つあります。それは、彼らには自社の商業的利益に適う発言をする動機があり、実際にそうしてきた実績があるからです。

例を挙げれば、サツケバーは OpenAI に在籍し、同社が資金調達を必要としていた時期に、スケーリングの重要性を強調する動機がありました。しかし今や、彼はスタートアップである Safe Superintelligence の首長となり、OpenAI、Anthropic、Google などの他社と競争できることを証明するために、はるかに少ない資本しか利用できない状況で投資家説得を図らなければなりません。おそらくだからこそ、彼は今、事前学習用のデータが尽きつつあるかのような議論を展開しているのでしょう。まるでそれが新たな洞察であるかのように振る舞いながら、実際には繰り返し繰り返されてきた点に過ぎないのです。

再確認しますが、モデルのスケーリングが終了したかどうかは私たちにわかりません。しかし、業界の突然の方向転換があまりにも大胆なものであり、内部関係者には何らかの予知能力などなく、他の人々と同様の推測を行っていること、さらにバブルの中にいて世界に売り込む過剰な期待を容易に受け入れることでバイアスがかかっていることは疑いの余地がないほど明白です。

これを踏まえて、私たちが提案するのは、特にジャーナリスト、政策立案者、そして AI コミュニティの皆さんに対して、技術の未来、とりわけその社会的影響を予測する際における内部関係者の見解への敬意を終わらせることです。これには努力が必要となるでしょう。なぜなら米国には、「極端な富とそれに伴う権力を徳や知性と同一視するという、明らかにアメリカ特有の病」という形での広範な無意識のバイアスが存在しているからです。（マリエッテ・シャケ著『The Tech Coup』に対するブライアン・ガーディナーのレビューより）

AI Snake Oil は AI への過剰な期待を暴き、新展開に関するエビデンスに基づく分析を発表しています。

推論のスケーリングを通じて能力の進展は続くのでしょうか？

もちろん、モデルのスケーリングが AI の能力を向上させる唯一の方法ではありません。推論のスケーリング（inference scaling）は、最近大きな進歩が見られる分野です。例えば、OpenAI の o1 やオープンウェイト競合である DeepSeek R1 は推論モデルであり、回答を提供する前に「推論」を行うようにファインチューニングされています。他の手法ではモデル自体を変更せず、多くの解決策を生成してその品質でランク付けするというトリックを採用します。

推論のスケーリングに関する主要な未解決の疑問が 2 つあり、これがこのトレンドがどの程度重要になるかを決定づけます。

どのような問題クラスに対して効果的なのでしょうか？

それが効果的な問題において、推論時に計算量を増やすことで、どれほどの改善が可能となるのでしょうか？

言語モデルのトークンあたりの出力コストは、ハードウェアとアルゴリズムの両方の改善により急速に低下しており、推論スケーリングが多数の桁にわたる改善をもたらす場合（例えば、あるタスクで 100 万トークンを生成することが 10 万トークンを生成するよりも著しく優れたパフォーマンスをもたらす場合）は、それは大きな意味を持つことになる。4

最初の質問に対する素直で直感的な答えは、推論スケーリングが明確な正解が存在する問題（コーディングや数学的問題解決など）に対して有用であるというものである。そのようなタスクでは、少なくとも以下の 2 つの関連する事柄のいずれかが真となる傾向がある。第一に、記号推論によって精度を向上させることができる。これは統計的な性質ゆえに LLM が苦手とする分野だが、出力トークンを用いて推論を行うことで克服できる。これは、人がペーパーとペンを使って数学問題を解きほぐすのに似ている。第二に、正しい解決策を検証することは、それを生成するよりも容易である（コーディングにおけるユニットテストや数学的定理証明における証明チェッカーなどの外部検証器が補助することもある）。

一方、ライティングや言語翻訳のようなタスクでは、推論スケーリングが大きな違いをもたらす方法を想像するのは難しく、特にその限界がトレーニングデータに起因する場合はなおさらである。例えば、モデルが低リソース言語への翻訳でうまく機能しない場合、その言語の慣用句を認識していないことが原因であれば、モデルは推論によってこの問題を解決することはできない。

これまでの初期証拠は、断片的ではあるものの、この直観と整合しています。OpenAI の o1 に焦点を当てると、コーディング、数学、サイバーセキュリティ、おもちゃの世界における計画、および各種試験において、GPT-4o などの最先端言語モデルと比較して改善が見られます。試験パフォーマンスの向上は、知識や創造性ではなく、回答に必要な推論の重要性と強く相関しているようです：数学、物理学、LSAT では大幅な改善があり、生物学や計量経済学のような科目では小さな改善にとどまり、英語ではほとんど改善がありません。

o1 が改善をもたらさないように見えるタスクには、文章作成、特定のサイバーセキュリティタスク（以下で説明）、毒性の回避、そして思考が人間を劣化させることが知られている興味深い一連のタスクが含まれます。

推論モデルと言語モデルの比較に関する利用可能な証拠を集約したウェブページを作成しました。当面はこれを更新し続ける予定ですが、 soon 発見される成果の洪水について追跡することが困難になるだろうと考えています。

さて、2 つ目の質問に移りましょう：無限の推論計算リソースを仮定した場合、推論スケーリングを通じてどれほどの改善が得られるでしょうか。

OpenAI が o1 の能力を示すために掲げた象徴的な例は AIME（数学ベンチマーク）です。そのグラフはこの問いをあえて魅力的なままにしています。パフォーマンスはもう飽和しようとしているのか、それとも 100% に近づけることができるのでしょうか。また、グラフが都合よく x 軸のラベルを省略している点にも注意が必要です。

image

出典：OpenAI

外部研究者によるこのグラフの再構成試行から、(1) x 軸のカットオフはおそらく約 2,000 トークン付近であること、そして (2) o1 にこれより長く思考させるよう指示しても、実際にはそうしないことが示されました。したがって、この問いはまだ答えられておらず、オープンソースモデルを用いた実験を待たなければ、より明確な理解は得られません。o1 の背後にある技術を実証する取り組みが活発に行われているのは素晴らしいことです。

最近発表された論文「推論スケーリングの法則（Inference Scaling fLaws）」では、推論スケーリングに対する異なるアプローチを検討しました。このタイトルは「推論スケーリングの法則（inference scaling laws）」という言葉遊びです。そのアプローチとは、外部検証器が正解と判断するまで解決策を繰り返し生成するというものです。この手法には、スケーリングを何桁も有意に向上させる可能性への期待が寄せられており（私たち自身も過去の研究でそのように主張しました）、我々の発見は、この手法が検証器の品質に対して極めて敏感であることを示しています。検証器がわずかに不完全な場合、多くの現実的なコーディングタスクの設定において、性能は約 10 回の試行で頭打ちとなり、その後むしろ低下し始めます。

一般的に、推論スケーリングの「法則」に関する証拠は説得力に欠けており、推論時に数百万トークンを生成することが実際に役立つかどうかについては、まだ実証されていません。

推論スケーリングが次のフロンティアとなるでしょうか？

推論スケーリングには手をつけやすい課題が多くあり、短期的な進展は急速なものになるでしょう。特に、現在の推論モデルの限界の一つとして、エージェントシステムではうまく機能しない点が挙げられます。私たちは、研究論文に付随するコードを再現することを要求する独自のベンチマーク CORE-Bench でこの現象を観察しました。Claude 3.5 Sonnet を使用した場合の最良のエージェントスコアは 38% ですが、o1-mini ではわずか 24% です。5 これまた、推論モデルがあるサイバーセキュリティ評価では改善をもたらしたが、別の評価ではそうならなかった理由を説明するものです — そのうちの一つにはエージェントが関与していたからです。

私たちは、エージェントが推論モデルから恩恵を受けないように見えるのには二つの理由があると考えています。第一に、そのようなモデルは通常のモデルとは異なるプロンプトスタイルを必要としますが、現在のエージェントシステムは通常のモデル向けのプロンプトに最適化されています。第二に、私たちが知る限り、これまでの推論モデルは、コードの実行、シェルとの対話、またはウェブ検索といった環境からのフィードバックを受け取る設定において、強化学習を用いて訓練されたことはありません。つまり、推論を学ぶ前の基盤となるモデルと比べても、ツール使用能力には何ら優れていないのです。6

これらは比較的単純な問題のように見えます。これらを解決することで、大幅に新しい AI エージェントの機能が実現される可能性があります——例えば、プロンプトから複雑で完全な機能を備えたアプリを生成することです。（すでにこれを試みるツールは存在しますが、うまく機能していません。）

では、長期的にはどうでしょうか？推論のスケーリングが、過去 7 年間にモデルのスケーリングで見られたのと同じような進展をもたらすのでしょうか。モデルのスケーリングがこれほど興奮を呼んだのは、「単に」データ、モデルサイズ、計算リソースを大きくすればよかったからです。アルゴリズム的な画期的な突破は必要ありませんでした。

しかし、推論のスケーリングについては（現時点では）そうではありません——推論スケーリングの手法には長いリストがあり、何が有効で何が無効かは問題に依存し、それらを総括しても機能するのは限られたドメインのみです。AI 開発者はこの制限を克服しようとしています。例えば、OpenAI の強化学習ファインチューニングサービスは、同社が将来のモデルをファインチューニングするために多様なドメインから顧客データを収集するための手段と考えられています。

約10年前、強化学習（Reinforcement Learning）はアタリなどの多くのゲームにおいて画期的な進展をもたらしました。当時大きな期待が寄せられ、多くのAI研究者たちは強化学習を通じて汎用人工知能（AGI）への道が開けると希望を抱いていました。実際、強化学習に対する高い期待こそが、特にOpenAIのような明確にAGIに焦点を当てた研究所の誕生につながったのです。しかし、これらの技術はゲームのような狭い領域を超えて一般化することはできませんでした。現在、再び強化学習について同様の過熱した議論が起こっています。これは明らかに非常に強力な技術ですが、これまでのところ、前回のブームが沈静化した際に見られたのと同様の限界が見えています。

AIの能力における進展が遅れるかどうかを予測することは不可能です。実際、予測など忘れてください。合理的な人々であっても、AIの進展がすでに鈍化しているかどうかについて非常に異なる見解を持つことができます。なぜなら、彼らは証拠を非常に異なる解釈で捉えることができるからです。その理由は、「能力」という概念が、それをどのように測定するかによって極めて敏感に反応する構築物だからです。

より確信を持って言えるのは、能力における進展の性質は、推論のスケーリング（Inference Scaling）とモデルのスケーリング（Model Scaling）では異なるということです。ここ数年、新しいモデルは予測可能に、毎年広範な領域全体で能力の向上をもたらしました。大規模研究所外の多くのAI研究者の間には、「次世代の最先端LLMがリリースされるのを待つ以外にやるべきことはほとんどない」という悲観的な空気がありました。

推論のスケーリングに伴い、能力の向上は不均衡かつ予測困難なものとなる可能性が高く、ハードウェアインフラへの投資よりもアルゴリズムの進展によってより強く牽引されるでしょう。LLM の支配時代に見捨てられた多くのアイデア、例えば古い計画文献からのものなどが再び注目されており、この分野はここ数年に比べて知的により活発になっているように見えます。

製品開発は能力向上に遅れをとる

能力の減速があるかどうかをめぐる激しい議論は皮肉なものです。なぜなら、能力の向上と AI の実世界での有用性との間の結びつきが極めて弱いからです。AI ベースのアプリケーションの開発は、AI 能力の増加に比べて著しく遅れており、既存の AI 能力さえも依然として大幅に未活用となっています。その理由の一つに、能力と信頼性のギャップがあります。特定の能力が存在していても、人間をループから外して実際にタスクを自動化できるほど十分に信頼性があるとは限らないのです（例えば、80% の確率でしか動作しない食品配達アプリを想像してみてください）。また、信頼性を向上させる手法は多くの場合アプリケーション依存であり、能力を向上させる手法とは異なります。ただし、推論モデルも同様に信頼性の向上を示しているように見え、これは非常に楽しみです。

現在の AI の能力を完全に活用する製品を構築するには、10 年以上かかる可能性がある理由を示すためのいくつかの例え話があります。インターネットやウェブを支える技術は主に 90 年代半ばに確立されました。しかし、さらに 1〜2 十年間かかりました

原文を表示

By Arvind Narayanan, Benedikt Ströbl, and Sayash Kapoor.

After the release of GPT-4 in March 2023, the dominant narrative in the tech world was that continued scaling of models would lead to artificial general intelligence and then superintelligence. Those extreme predictions gradually receded, but up until a month ago, the prevailing belief in the AI industry was that model scaling would continue for the foreseeable future.

Then came three back-to-back news reports from The Information, Reuters, and Bloomberg revealing that three leading AI developers — OpenAI, Anthropic, and Google Gemini — had all run into problems with their next-gen models. Many industry insiders, including Ilya Sutskever, probably the most notable proponent of scaling, are now singing a very different tune:

“The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.” (Reuters)

The new dominant narrative seems to be that model scaling is dead, and “inference scaling”, also known as “test-time compute scaling” is the way forward for improving AI capabilities. The idea is to spend more and more computation when using models to perform a task, such as by having them “think” before responding.

This has left AI observers confused about whether or not progress in AI capabilities is slowing down. In this essay, we look at the evidence on this question, and make four main points:

Declaring the death of model scaling is premature.

Regardless of whether model scaling will continue, industry leaders’ flip flopping on this issue shows the folly of trusting their forecasts. They are not significantly better informed than the rest of us, and their narratives are heavily influenced by their vested interests.

Inference scaling is real, and there is a lot of low-hanging fruit, which could lead to rapid capability increases in the short term. But in general, capability improvements from inference scaling will likely be both unpredictable and unevenly distributed among domains.

The connection between capability improvements and AI’s social or economic impacts is extremely weak. The bottlenecks for impact are the pace of product development and the rate of adoption, not AI capabilities.

Is model scaling dead?

There is very little new information that has led to the sudden vibe shift. We’ve long been saying on this newsletter that there are important headwinds to model scaling. Just as we cautioned back then about scaling hype, we must now caution against excessive pessimism about model scaling.

“Scaling as usual” ended with GPT-4 class models, because these models are trained on most of the readily available data sources. We already knew that new ideas would be needed to keep model scaling going. So unless we have evidence that many such ideas have been tried and failed, we can’t conclude that there isn’t more mileage to model scaling.

As just one example, it is possible that including YouTube videos — the actual videos, not transcribed text — in the training mix for multimodal models will unlock new capabilities. Or it might not help; we just won’t know until someone tries it, and we don’t know if it has been tried or not. Note that it would probably have to be Google, because the company is unlikely to license YouTube training data to competitors.1

If things are still so uncertain regarding model scaling, why did the narrative flip? Well, it’s been over two years since GPT-4 finished training, so the idea that next-gen models are simply taking a bit longer than expected was becoming less and less credible. And once one company admits that there are problems, it becomes a lot easier for others to do so. Once there is a leak in the dam, it quickly bursts. Finally, now that OpenAI’s reasoning model o1 is out, it has given companies an out when admitting that they have run into problems with model scaling, because they can save face by claiming that they will simply switch to inference scaling.

To be clear, there is no reason to doubt the reports saying that many AI labs have conducted larger training runs and yet not released the resulting models. But it is less clear what to conclude from it. Some possible reasons why bigger models haven’t been released include:

Technical difficulties, such as convergence failures or complications in achieving fault tolerance in multi-datacenter training runs.

The model was not much better than GPT-4 class models, and so would be too underwhelming to release.

The model was not much better than GPT-4 class models, and so the developer has been spending a long time trying to eke out better performance through fine tuning.

To summarize, it’s possible that model scaling has indeed reached its limit, but it’s also possible that these hiccups are temporary and eventually one of the companies will find ways to overcome them, such as by fixing any technical difficulties and/or finding new data sources.

Let’s stop deferring to insiders

Not only is it strange that the new narrative emerged so quickly, it’s also interesting that the old one persisted for so long, despite the potential limitations of model scaling being obvious. The main reason for its persistence is the assurances of industry leaders that scaling would continue for a few more years.2 In general, journalists (and most others) tend to defer to industry insiders over outsiders. But is this deference justified?

Industry leaders don’t have a good track record of predicting AI developments. A good example is the overoptimism about self-driving cars for most of the last decade. (Autonomous driving is finally real, though Level 5 — full automation — doesn’t exist yet.) As an aside, in order to better understand the track record of insider predictions, it would be interesting to conduct a systematic analysis of all predictions about AI made in the last 10 years by prominent industry insiders.

There are some reasons why we might want to give more weight to insiders’ claims, but also important reasons to give less weight to them. Let’s analyze these one by one. It is true that industry insiders have proprietary information (such as the performance of as-yet-unreleased models) that might make their claims about the future more accurate. But given how many AI companies are close to the state of the art, including some that openly release model weights and share scientific insights, datasets, and other artifacts, we’re talking about an advantage of at most a few months, which is minor in the context of, say, 3-year forecasts.

Besides, we tend to overestimate how much additional information companies have on the inside — whether in terms of capability or (especially) in terms of safety. Insiders warned for a long time that “if only you know what we know...” but when whistleblowers finally came forward, it turns out that they were mostly relying on the same kind of speculation that everyone else does.3

Another potential reason to give more weight to insiders is their technical expertise. We don’t think this is a strong reason: there is just as much AI expertise in academia as in industry. More importantly, deep technical expertise isn’t that important to support the kind of crude trend extrapolation that goes into AI forecasts. Nor is technical expertise enough — business and social factors play at least as big a role in determining the course of AI. In the case of self-driving cars, one such factor is the extent to which societies tolerate public roads being used for experimentation. In the case of large AI models, we’ve argued before that the most important factor is whether scaling will make business sense, not whether it is technically feasible. So not only do techies not have much of an advantage, their tendency to overemphasize the technical dimensions tends to result in overconfident predictions.

In short, the reasons why one might give more weight to insiders’ views aren’t very important. On the other hand, there’s a huge and obvious reason why we should probably give less weight to their views, which is that they have an incentive to say things that are in their commercial interests, and have a track record of doing so.

As an example, Sutskever had an incentive to talk up scaling when he was at OpenAI and the company needed to raise money. But now that he heads the startup Safe Superintelligence, he needs to convince investors that it can compete with OpenAI, Anthropic, Google, and others, despite having access to much less capital. Perhaps that is why he is now talking about running out of data for pre-training, as if it were some epiphany and not an endlessly repeated point.

To reiterate, we don’t know if model scaling has ended or not. But the industry’s sudden about-face has been so brazen that it should leave no doubt that insiders don’t have any kind of crystal ball and are making similar guesses as everyone else, and are further biased by being in a bubble and readily consuming the hype they sell to the world.

In light of this, our suggestion — to everyone, but especially journalists, policymakers, and the AI community — is to end the deference to insiders' views when they predict the future of technology, especially its societal impacts. This will take effort, as there is a pervasive unconscious bias in the U.S., in the form of a “distinctly American disease that seems to equate extreme wealth, and the power that comes with it, with virtue and intelligence.” (from Bryan Gardiner’s review of Marietje Schake’s The Tech Coup.)

AI Snake Oil debunks AI hype and publishes evidence-based analysis of new developments.

Will progress in capabilities continue through inference scaling?

Of course, model scaling is not the only way to improve AI capabilities. Inference scaling is an area with a lot of recent progress. For example, OpenAI’s o1 and the open-weights competitor DeepSeek R1 are reasoning models: they have been fine tuned to “reason” before providing an answer. Other methods leave the model itself unchanged but employ tricks like generating many solutions and ranking them by quality.

There are two main open questions about inference scaling that will determine how significant of a trend it will be.

What class of problems does it work well on?

For problems where it does work well, how much of an improvement is possible by doing more computation during inference?

The per-token output cost of language models has been rapidly decreasing due to both hardware and algorithmic improvements, so if inference scaling yields improvements over many orders of magnitude — for example, generating a million tokens on a given task yields significantly better performance than generating a hundred thousand tokens — that would be a big deal.4

The straightforward, intuitive answer to the first question is that inference scaling is useful for problems that have clear correct answers, such as coding or mathematical problem solving. In such tasks, at least one of two related things tend to be true. First, symbolic reasoning can improve accuracy. This is something LLMs are bad at due to their statistical nature, but can overcome by using output tokens for reasoning, much like a person using pen and paper to work through a math problem. Second, it is easier to verify correct solutions than to generate them (sometimes aided by external verifiers, such as unit tests for coding or proof checkers for mathematical theorem proving).

In contrast, for tasks such as writing or language translation, it is hard to see how inference scaling can make a big difference, especially if the limitations are due to the training data. For example, if a model works poorly in translating to a low-resource language because it isn’t aware of idiomatic phrases in that language, the model can’t reason its way out of this.

The early evidence we have so far, while spotty, is consistent with this intuition. Focusing on OpenAI o1, it improves compared to state-of-the-art language models such as GPT-4o on coding, math, cybersecurity, planning in toy worlds, and various exams. Improvements in exam performance seem to strongly correlate with the importance of reasoning for answering questions, as opposed to knowledge or creativity: big improvements for math, physics and LSATs, smaller improvements for subjects like biology and econometrics, and negligible improvement for English.

Tasks where o1 doesn’t seem to lead to an improvement include writing, certain cybersecurity tasks (which we explain below), avoiding toxicity, and an interesting set of tasks at which thinking is known to make humans worse.

We have created a webpage compiling the available evidence on how reasoning models compare against language models. We plan to keep it updated for the time being, though we expect that the torrent of findings will soon become difficult to keep up with.

Now let’s consider the second question: how large of an improvement can we get through inference scaling, assuming we had an infinite inference compute budget.

OpenAI’s flagship example to show off o1’s capabilities was AIME, a math benchmark. Their graph leaves this question tantalizingly open. Is the performance about to saturate, or can it be pushed close to 100%? Also note that the graph conveniently leaves out x-axis labels.

Source: OpenAI

An attempt by external researchers to reconstruct this graph shows that (1) the cutoff for the x-axis is probably around 2,000 tokens, and (2) when o1 is asked to think longer than this, it doesn’t do so. So the question remains unanswered, and we need to wait for experiments using open-source models to get more clarity. It is great to see that there are vigorous efforts to publicly reproduce the techniques behind o1.

In a recent paper called Inference Scaling fLaws (the title is a pun on inference scaling laws), we look at a different approach to inference scaling — repeatedly generating solutions until one of them is judged as correct by an external verifier. While this approach has been associated with hopes of usefully increasing scaling by many orders of magnitude (including by us in our own past work), we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts.

Generally speaking, the evidence for inference scaling “laws” is not convincing, and it remains to be seen if there are real-world problems where generating (say) millions of tokens at inference time will actually help.

Is inference scaling the next frontier?

There is a lot of low-hanging fruit for inference scaling, and progress in the short term is likely to be rapid. Notably, one current limitation of reasoning models is that they don’t work well in agentic systems. We have observed this in our own benchmark CORE-Bench that asks agents to reproduce the code provided with research papers — the best performing agent scores 38% with Claude 3.5 Sonnet compared to only 24% with o1-mini.5 This also explains why reasoning models led to an improvement in one cybersecurity eval but not another — one of them involved agents.

We think there are two reasons why agents don’t seem to benefit from reasoning models. Such models require different prompting styles than regular models, and current agentic systems are optimized for prompting regular models. Second, as far as we know, reasoning models so far have not been trained using reinforcement learning in a setting where they receive feedback from the environment — be it code execution, shell interaction, or web search. In other words, their tool use ability is no better than the underlying model before learning to reason.6

These seem like relatively straightforward problems. Solving them might enable significant new AI agent capabilities — for example, generating complex, fully functional apps from a prompt. (There are already tools that try to do this, but they don’t work well.)

But what about the long run? Will inference scaling lead to the same kind of progress we’ve seen with model scaling over the last 7 years? Model scaling was so exciting because you “merely” needed to make data, model size, and compute bigger; no algorithmic breakthroughs were needed.

That’s not true (so far) with inference scaling — there’s a long list of inference scaling techniques, and what works or doesn’t work is problem-dependent, and even collectively, they only work in a circumscribed set of domains. AI developers are trying to overcome this limitation. For example, OpenAI’s reinforcement finetuning service is thought to be a way for the company to collect customer data from many different domains for fine-tuning a future model.

About a decade ago, reinforcement learning (RL) led to breakthroughs in many games like Atari. There was a lot of hype, and many AI researchers hoped we could RL our way to AGI. In fact, it was the high expectations around RL that led to the birth of explicitly AGI-focused labs, notably OpenAI. But those techniques didn’t generalize beyond narrow domains like games. Now there is similar hype about RL again. It is obviously a very powerful technique, but so far we’re seeing limitations similar to the ones that led to the dissipation of the previous wave of hype.

It is impossible to predict whether progress in AI capabilities will slow down. In fact, forget prediction — reasonable people can have very different opinions on whether AI progress has already slowed down, because they can interpret the evidence very differently. That’s because “capability” is a construct that’s highly sensitive to how you measure it.

What we can say with more confidence is that the nature of progress in capabilities will be different with inference scaling than with model scaling. In the last few years, newer models predictably brought capability improvements each year across a vast swath of domains. There was a feeling of pessimism among many AI researchers outside the big labs that there was little to do except sit around and wait for the next state-of-the-art LLM to be released.

With inference scaling, capability improvements will likely be uneven and less predictable, driven more by algorithmic advances than investment in hardware infrastructure. Many ideas that were discarded during the reign of LLMs, such as those from the old planning literature, are now back in the mix, and the scene seems intellectually more vibrant than in the last few years.

Product development lags capability increase

The furious debate about whether there is a capability slowdown is ironic, because the link between capability increases and the real-world usefulness of AI is extremely weak. The development of AI-based applications lags far behind the increase of AI capabilities, so even existing AI capabilities remain greatly underutilized. One reason is the capability-reliability gap — even when a certain capability exists, it may not work reliably enough that you can take the human out of the loop and actually automate the task (imagine a food delivery app that only works 80% of the time). And the methods for improving reliability are often application-dependent and distinct from methods for improving capability. That said, reasoning models also seem to exhibit reliability improvements, which is exciting.

Here are a couple of analogies that help illustrate why it might take a decade or more to build products that fully take advantage of even current AI capabilities. The technology behind the internet and the web mostly solidified in the mid-90s. But it took 1-2 more decades

この記事をシェア

One Useful Thing重要度42026年7月24日 03:05

AI活用ガイド：何に使うべきか

The Zvi重要度42026年7月25日 22:40

Claude Opus 5 システムカード発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

AI Snake Oil·2024年12月19日 01:47·約24分

AIの進歩は鈍化しているのか？

#LLM #推論スケーリング #モデルサイズ #OpenAI #Anthropic

TL;DR

AI深層分析2026年5月3日 04:06

重要/ 5段階

深度40%

キーポイント

スケーリング終焉説への懐疑

業界リーダーの予測の不確実性

推論スケーリングの現実と限界

能力向上と社会影響の乖離

AI の技術的能力が向上しても、それが即座に社会的・経済的なインパクトを生むとは限らず、ボトルネックは製品開発の速度や普及率にある。

モデルスケーリングの限界と代替案

業界リーダーへの盲信の再考

マルチモーダル学習の可能性

重要な引用

"The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again."

Declaring the death of model scaling is premature.

The connection between capability improvements and AI's social or economic impacts is extremely weak.

Once there is a leak in the dam, it quickly bursts.

But is this deference justified?

given how many AI companies are close to the state of the art... we're talking about an advantage of at most a few months, which is minor in the context of, say, 3-year forecasts.

影響分析・編集コメントを表示

影響分析

編集コメント

アーヴィンド・ナラヤナン、ベネディクト・シュトロブ、サイアッシュ・カプールによる。

モデルのスケーリング（拡大）の死を宣言するのは早計である。

モデルのスケーリングは死んだのか？

技術用語：

モデルのスケーリング（Model Scaling）: 機械学習モデルのサイズや計算リソースを増大させる手法。
推論スケーリング（Inference Scaling）: 推論時の計算リソースやデータ量を増やすことで性能を向上させるアプローチ。

技術的な困難、例えば収束の失敗や、マルチデータセンターでのトレーニングランにおける耐障害性の達成に伴う複雑さ。

そのモデルは GPT-4 クラスのモデルと比べてそれほど優れておらず、公開するにはあまりにもがっかりさせるものになるため。

内部関係者に委ねるのをやめよう

AI Snake Oil は AI への過剰な期待を暴き、新展開に関するエビデンスに基づく分析を発表しています。

推論のスケーリングを通じて能力の進展は続くのでしょうか？

推論のスケーリングに関する主要な未解決の疑問が 2 つあり、これがこのトレンドがどの程度重要になるかを決定づけます。

どのような問題クラスに対して効果的なのでしょうか？

それが効果的な問題において、推論時に計算量を増やすことで、どれほどの改善が可能となるのでしょうか？

さて、2 つ目の質問に移りましょう：無限の推論計算リソースを仮定した場合、推論スケーリングを通じてどれほどの改善が得られるでしょうか。

image

出典：OpenAI

推論スケーリングが次のフロンティアとなるでしょうか？

製品開発は能力向上に遅れをとる

原文を表示

By Arvind Narayanan, Benedikt Ströbl, and Sayash Kapoor.

This has left AI observers confused about whether or not progress in AI capabilities is slowing down. In this essay, we look at the evidence on this question, and make four main points:

Declaring the death of model scaling is premature.

Is model scaling dead?

Technical difficulties, such as convergence failures or complications in achieving fault tolerance in multi-datacenter training runs.

The model was not much better than GPT-4 class models, and so would be too underwhelming to release.

The model was not much better than GPT-4 class models, and so the developer has been spending a long time trying to eke out better performance through fine tuning.

Let’s stop deferring to insiders

AI Snake Oil debunks AI hype and publishes evidence-based analysis of new developments.

Will progress in capabilities continue through inference scaling?

There are two main open questions about inference scaling that will determine how significant of a trend it will be.

What class of problems does it work well on?

For problems where it does work well, how much of an improvement is possible by doing more computation during inference?

Now let’s consider the second question: how large of an improvement can we get through inference scaling, assuming we had an infinite inference compute budget.

Source: OpenAI

Is inference scaling the next frontier?

Product development lags capability increase

この記事をシェア

One Useful Thing重要度42026年7月24日 03:05

AI活用ガイド：何に使うべきか

The Zvi重要度42026年7月25日 22:40

Claude Opus 5 システムカード発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み