Import AI·2026年2月16日 23:01·約22分で読める

Import AI 445：超知能のタイミング；AIが最先端数学証明を解く；新たなML研究ベンチマーク

#経済学 #推薦システム #スケーリング法則 #雇用影響 #Meta

TL;DR

経済学者はAI自動化が進んでも「人間のタッチ」への需要が常にあることを指摘し、Metaは推薦システムの効率化とスケーリング法則の発見を発表している。

AI深層分析2026年4月27日 13:46

注目/ 5段階

深度40%

キーポイント

「人間のタッチ」経済学の示唆

AIが高度化しても、ライブ音楽や接客など「人間らしさ」を求める需要は所得上昇とともに増える「正常財」となり、新たな雇用創出の可能性を示唆する。

Metaの推薦システム「Kunlun」の公開

Metaがより効率的な推薦システム「Kunlun」を開発し、広告収益の基盤となる技術革新と予測可能なスケーリング法則の実証を行った。

AI革命における雇用構造の変化

大量の自動化が進む中で、人間同士の接触を重視する職業が拡大し、政策と経済成長によりこれらの賃金が大幅に上昇する可能性があるという仮説を提示。

レコメンデーションモデルのスケーリング法則の発見

FacebookはKunlunシステムにおいて予測可能なスケーリング法則を特定し、膨大な計算資源の投入に対するより確実な成果を得られるようにした。

Kunlunによる計算効率の大幅な向上

新しいアーキテクチャを採用したKunlunは、NVIDIA B200 GPU上でModel FLOPs Utilization (MFU) を17%から37%へと大幅に改善した。

レコメンデーションモデルとLLMの効率性の違い

レコメンデーションモデルは不均一な特徴空間やメモリ制約によりLLMに比べて効率が低い（MFU 3-15%）が、最適化によりその差を縮めた。

Kunlunモデルの予測可能なスケーリング法則

MetaのKunlunモデルは、計算量やレイヤー数に対して正規化エントロピーが予測可能なパワーローでスケールし、広告モデルへの導入によりトップライン指標が1.2%改善した。

影響分析・編集コメントを表示

影響分析

この記事は、AI技術の進歩が単純な雇用の喪失をもたらすだけでなく、「人間性」を価値とする新たな経済セクターを生み出す可能性を示唆しており、社会政策的な議論の重要な視点を提供している。同時に、Metaのような大規模テック企業による基盤技術（推薦システム）の最適化と法則性の解明は、AI開発のコスト効率を高め、競争を激化させる要因となる。

編集コメント

技術的な進歩（MetaのKunlun）と社会的な受容（人間のタッチへの需要）という、ハード面とソフト面の両側面からAIの未来を考察できる興味深い記事です。

imageAI 研究に関するニュースレター「Import AI」へようこそ。本誌は arXiv と読者からのフィードバックに基づいて運営されています。ご支援いただける場合は、ぜひ購読してください。

今すぐ購読する

エコノミスト誌：AI に起因する失業を心配する必要はない、なぜなら人々は「人間らしさ」にお金を払うことを好むからだ：

…技術的に何かを自動化できる場合でも、人は依然として人間を選ぶかもしれない……経済革新グループの首席エコノミストであるアダム・オジメク氏は、ブログ記事で、AI が大幅に向上し、人々が行うすべての仕事をこなせるようになったとしても、特定の分野では機械よりも人間を好む傾向があるため、人間向けの仕事は残ると指摘しています。

「すでに自動化できていたはずの仕事やタスクは数多くあります。そのための技術も長く存在しているにもかかわらず、私たちは依然としてそれらを手作業で行っています」と彼は記述します。「その理由は、私が『人間らしさ』と呼ぶものを提供する特定の職種に対する需要が常に存在するからです。」

ここではいくつかの例を挙げましょう。ライブ音楽、俳優、ウェイター、旅行代理店、そして多くの種類の販売職などです。ある特定の財や体験にお金をかけようとするほど、人との接触をより多く望むようになるようです。「人間によるタッチも、経済学者が『ノーマル・グッド（正常財）』と呼ぶものであり、所得が増えるにつれてその需要が高まる」と彼は書いています。ここでの例としては、高級レストランや、コンシェルジュのような体験などが挙げられるでしょう。

なぜこれが重要なのか – AI 革命における一つの道筋は、人間同士の仕事へのシフトかもしれません。私の仮説では、「人は人好きである」ため、AI が現在の経済の巨大な部分を自動化したとしても、まだ想像もつかない新しい職種の『人間の職人』に対する需要が爆発的に伸びる可能性が高く、既存の人間の職業の洗練にもつながると考えられます。また、経済成長と政府による進歩的な政策の実施という組み合わせを通じて、これらの職種の賃金が大幅に上昇する可能性もあります。

さらに読む：AI と人間によるタッチの経済学（Agglomerations, Substack）.

Facebook はより優れたレコメンデーションシステムを構築し、レコメンデーションのスケーリング法則（scaling laws）の解明にも成功しました。

…Kunlun は、産業用 AI がどのような姿をしているかを示すもう一つの素晴らしい例です…

Facebook は、広告業界の巨大企業がこれまで開発してきたシステムよりも効率的なレコメンデーションシステム「Kunlun」の詳細を公開しました。これに伴い、同社は Kunlun モデルに対する予測可能なスケーリング法則（scaling law）も解明し、これまでにない規模の計算資源（compute）をこれらのモデルに投資しても、より予測可能なリターンが得られるようになり、企業の投資判断が容易になりました。これは大きな出来事です。なぜなら、Facebook などの企業が広告を通じて巨額の収益を得ているのはレコメンデーションシステムのおかげであり、また、Facebook や他のソーシャルプラットフォームを利用する数十億人の人々の購買行動や注意の向け方にも極めて大きな影響を及ぼしているからです。

推薦システムは LLM とは異なります：Claude や ChatGPT などの大規模言語モデル（LLM）については、スケーリング法則がすでに確立されていますが、推薦モデルに対して同様のスケーリング法則を開発するのは困難でした。これは、推薦モデルの動作原理が LLM と大きく異なるためであり、ここでは「逐次的なユーザー行動と非逐次的な文脈特徴を同時にモデル化するシステムにおける未解決の課題」として、スケーリングモデルの構築に取り組む必要があります。

また、推薦モデルは一般的に LLM よりも効率が低くなる傾向があります。推薦システムでは、多様な特徴空間によって埋め込み次元が小さくなったり、テンソル形状が不規則になったり、メモリーバウンド型の演算が発生したりするため、モデル FLOPs 利用率（MFU）は 3〜15% に留まります。一方、LLM では 40〜60% の MFU を達成しています。

Kunlun：論文の大部分は、Kunlun の設計に関する議論です。Kunlun は基本的に最適化された推薦システムであり、これにより MFU が向上しました。Kunlun には、GDPA（Generalized Dynamic Positional Attention）で強化されたパーソナライズされた順方向ネットワークとマルチヘッド自己注意機構を備えた「Kunlun Transformer Block」が含まれており、これは文脈を意識したシーケンスモデリングを実現します。また、「Kunlun Interaction Block」も搭載されており、これはパーソナライズされた重み生成、階層的なシーケンス要約、およびグローバル特徴相互作用を通じて双方向の情報交換を可能にします。Facebook 社が Kunlun を構築する際に採用したその他の工夫についても多数ありますが、詳細は論文をご覧ください。最終的に、Kunlun は NVIDIA B200 GPU 上で MFU を 17% から 37% に改善しました。

なぜこれが重要なのか – お金のスケーリング法則：論文の核心的洞察は、クンルンモデルが予測可能にスケールし、言語モデルが見せるようなべき乗則（パワーロー）のスケーリング挙動を示す点にあります。ただし、LLM の場合、スケーリング法則は通常、基盤となるデータセット上の損失の減少を通じて評価されるのに対し、ここでは正規化エントロピー（NE）が用いられます。Facebook での実験では、モデル訓練に投入されたギガフロップスの量に対する NE 向上における信頼性の高いスケーリング法則と、使用されるレイヤー数に基づく NE 改善に関する関連するスケーリング法則の両方が発見されました。

クンルンモデルは「主要な Meta Ads モデル全体に展開され、トップライン指標で 1.2% の向上をもたらしました」。

ここで私たちが目撃しているのは、世界中で最も社会的に重要な AI システムの一部（数十億人の注目を多様な製品やオンライン情報へと向けるシステム）が、より高いレベルのパフォーマンス予測可能性と衝突しているという状況です。これらのスケーリング法則を開発したことで、Meta は、資本投資に対する知能リターンの予測可能性を高めることにより、これらのモデルをさらに改善するためにさらなる計算資源（コンピュート）を投入しやすくなりました。

続きを読む：Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design (arXiv)。

スーパーインテリジェンスは命を救い、延ばすことができるので、私たちはそれを目指すべきである：

…指数関数的な成長の非常に最終段階において停止または減速することは理にかなっているかもしれないが、それはリスクを伴う…

多くの人がスーパーインテリジェンスや AI のリスクという概念を知るきっかけを作った学者ニック・ボストロムは、もしスーパーインテリジェンスが人間の健康を改善できるのであれば、人類の絶滅をもたらす非ゼロの確率が存在するとしても、それを目指す価値があるという考えを論文で展開した。

「ユドコフスキーとソアレスは、誰かが AGI（人工一般知能）を構築すれば全員が死ぬと主張している。しかし逆に、誰もそれを構築しなければ全員が死ぬと主張することも equally 可能である」とボストロムは『スーパーインテリジェンスの最適タイミング』で記述している。「スーパーインテリジェンスの時代への移行がうまくいけば、現在存在する個人の命を救うことと、地球起源の知的生命体の長期的な生存と繁栄を守ることの両面で、莫大な upside（利益）がある。したがって、私たちが直面している選択は、リスクのないベースラインとリスクのある AI 事業の間にあるのではない。それは、それぞれが異なる一連のハザードに私たちをさらす、異なるリスクある軌道間の選択である。」

なぜ私たちは、破滅の可能性があるにもかかわらずスーパーインテリジェンスを追求すべきなのか：もし今日生存しているすべての人間と、彼らが経験する異なる平均寿命、特に発展途上国の人々について考えてみると、スーパーインテリジェンスの展開に浪費されるあらゆる瞬間が人間の苦痛を増大させるという見解に引き寄せられることになります。

「両方の側面を勘定に入れると、スーパーインテリジェンスが比較的早期に開発された場合の方が、我々の個々の平均寿命は長くなることが明らかになります。さらに、得る可能性のある人生の質は、失うリスクがある現在の人生よりも圧倒的に高いものである可能性が高い」とボストロムは記述しています。

主要な変数：ここで重要な変数はもちろん、スーパーインテリジェンスが人類を滅ぼすリスクと、この確率を低減させるための安全性研究（AI safety research）の進展速度です。この見解によれば、ほとんどの状況下でスーパーインテリジェンスを開発することは望ましい行為となります。

進歩の速度や AI 安全性研究の成熟度は、タイムラインに何らかの影響を与える可能性があります：「初期リスクが低い場合、最適な戦略は AGI（汎用人工知能）を可能な限り早く立ち上げることです。ただし、安全性に関する進展が極めて急速な場合は例外であり、その場合には数か月のわずかな遅延が正当化されるかもしれません。初期リスクが高まるにつれて、最適な待機時間は長くなります。しかし、開始時のリスクが非常に高く、安全性の進展が遅々としていない限り、推奨される遅延は控えめにとどまり、通常は単一の桁で表せる年数（1〜9 年程度）です」

一時停止について、およびその危険性と利点：AI セーフティコミュニティの多くの人々は、AI 開発に何らかの一時停止を設け、AI セーフティ研究のための時間を確保したいと考えています。ボストロムは、一時停止が効果的であるとは懐疑的であり、それがもたらす可能性のある望ましくない影響の一部を概説しています：

時期尚早：早期に行えば、人々は一時停止が無効であると考えるでしょう。

不適切な規制：悪い規制のために、将来の有益なものが阻害されたり遅れたりする可能性があります。

国家安全保障を除く一時停止：広範な社会的利益はほとんどないが、強力な AI にアクセスできる軍隊は非常に恐ろしいものになる。

危険の長期化：より高度な AI によって提供される防御がないまま、現在の AI からのリスクに世界がさらされ続けることになる。

なぜこれが重要なのか – 一時停止は最終段階においてのみ意味を成す可能性があり、これは本質的にリスクを伴います：ボストロムは最終的に、開発の一時停止や遅延を望むのであれば、その一時停止が効果的であり、人類滅亡の可能性を低減することに寄与し、かつ早すぎないという確信が最も高い段階で行うべきだと結論づけます。これにより、過度な一時停止のリスクを負わずにスーパーインテリジェンスをどのように展開するかについて、最大限の熟考が可能になります。

この見解に対する批判者は、これは落下してくるナイフを掴もうとするよう推奨するものと同義であると述べるかもしれません。ナイフを早すぎた段階で掴めば激痛を伴い、遅すぎた段階で掴めばチャンスを逃し、重力が共謀してその下にあるものに甚大な害を与えます。タイミングを完璧に合わせる必要があるのです。

ボストロムは自らの立場を「港へは素早く入り、岸壁にはゆっくりと接岸せよ」と要約します。「AGI 能力に向けては迅速に進み、残された安全性の課題や状況の詳細についてより多くの情報を得るにつれて、スケールアップと展開という重要な段階を航行する際に、一時停止や調整を行う準備を整えておくこと。最も大きな利益をもたらす可能性があるのは、まさにその最終段階における一時的な一時停止です」。

さらに読む：スーパーインテリジェンスの最適なタイミング（ニック・ボストロム、PDF）

AI エージェントは、基本的な AI 研究タスクを自律的に遂行できるのでしょうか？AIRS-BENCH はその答えを「はい」と示しています。

…そして、今日のモデルは論文で示唆されているよりも、この点においてさらに優れていると期待できます…

Meta、オックスフォード大学、ロンドン大学ユニバーシティカレッジの研究者らは、AI 研究科学ベンチマーク（AIRS-BENCH）を構築し公開しました。これは、AI システムが現代の機械学習タスクをどの程度完遂できるかをテストするための手段です。

AIRS-BENCH の構成：AIRS-BENCH は、エージェントが最近発表された 17 件の機械学習論文から出題された 20 の異なるタスクをどの程度解決できるかをテストします。これらのタスクは、分子・タンパク質の機械学習、質問応答、テキスト抽出とマッチング、時系列分析、テキスト分類、コード、数学など、多様な技術分野にわたっています。

いくつかの具体例：

CodeGenerationAPPSPassAt5: 各問題に対して 5 つの異なる Python プログラムを生成することで、コーディング問題を解決します。

CoreferenceResolutionWinograndeAccuracy: 文内の代名詞が、2 つの可能な選択肢のどちらを指しているかを特定します。これは、曖昧な代名詞と 2 つの正解を含む文からなる Winogrande データセットを使用しています。

TimeSeriesForecastingRideshareMAE: Monash Time Series Forecasting Repository の一部である Rideshare データセットに対して時系列予測を行います。

結果：現実的な課題と低品質なモデル：これはやや複雑なベンチマークです。タスク自体は興味深く、多くの複雑さを内包しています。しかし、この論文でテストされているのは比較的に性能の低いモデルのみであり、例えば Code World Model、o3-mini、gpt-oss-20b、gpt-oss-120b、GPT-4o、Devstral-Small 24B などです。これは非常に面白いモデル群ですが、これらはいずれも真の意味での最前線（フロンティア）を代表するものではありません。論文の著者の一人が Twitter で「公開までに時間がかかった」と述べているように、これは単に出版プロセスが遅いことによる副産物である可能性もあります。

テスト結果では、どのモデルも最高クラスの人間と同等のエロレーティングには達していませんでしたが、より強力なモデルの結果を確認するまでは、この点について何とも言えない状況です。

なぜこれが重要なのか – モデルは人間とは異なる解決策を生成する可能性があり、ここに「スケーリング法則」が存在するかどうかを研究するための面白い方法です。このことが興味深い可能性がある一つの側面は、モデルが人間に対してタスクを解決する異なる方法を理解することです。一例として、TextualClassificationSickAccuracy では、モデルは 2 つの文の間に包含関係、矛盾関係、あるいは何の関係もないといういずれかの関係があるかどうかを判断する必要がありました。

文献における SOTA（State-of-the-Art）は、RoBERTa を基礎となるトレーニングセット上で微調整し、テストセットで評価した人物です。これと比較して、ベストにテストされた AIRS-BENCH エージェントである GPT-OSS-120B は、「複数のトランスフォーマーモデルとメタラーナーを組み合わせた 2 レベルのスタック型アンサンブルを生成します。RoBERTa-large と DeBERTa-v3-large はそれぞれ独立して SICK トレーニングセット上で微調整されます。各モデルは文のペアを処理し、各クラスに対するロジット（logits）を出力します。トレーニングは 5 フォールド層別化交差検証（stratified cross-validation）を使用して行われ、堅牢なアウト・オブ・フォールド（OOF）予測を保証し、過学習を防ぎます。両方のベースモデルからのロジットが結合され、各例に対する特徴ベクトルを形成します。」

これは非常に複雑です！しかし、エージェントの進展について何か学べるかもしれない点でも興味深いです。つまり、タスクへの解決策の単純さが規模とどのようにスケーリングするかを見ることで、直感的にはより強力なモデルほど単純な解決策に到達すると予想されるからです。ブラーズ・パスカルはかつて（伝説的に）「この手紙が長くなったのは、短くする時間がなかったからだ」と言いました。

続きを読む：AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (arXiv)。

数学研究者たちは、AI が彼らの未解決のフロンティア問題に対する独自の解法を支援できるかどうかを試しています。その答えは：部分的に yes です。

…First Proof は真に独立したテストセットです…

数学者たちのグループが「First Proof」という数学テストを構築しました。これは、2026 年 2 月 13 日までに公開された解法が存在しない数学問題に対して、AI システムがいかによく対応できるかを評価するものです。

First Proof とは何か：著者たちはこう記しています。「我々は、研究プロセスの中で自然に生じた 10 の数学問題を共有します。これらの問題はこれまで公には共有されておらず、答えは質問の作成者によって知られていますが、短期間暗号化されたままとなります」。この問題群は、「代数的組合せ論、スペクトルグラフ理論、代数的トポロジー、確率解析、シンプレクティック幾何学、表現論、リー群における格子、テンソル解析、数値線形代数という数学分野から抽出されたものであり、それぞれが著者一人ひとりの研究プロセスの中で自然に生じたものです」とあります。

著者たちは、First Proof が「現在数学者たちが取り組んでいる問題の真の分布からサンプリングされた最初の数学ベンチマーク」であると信じており、秘密性という独自な利点を持っていると述べています。「各問題は作成者によって解かれており、その証明は概ね 5 ページ以下ですが、答えはまだインターネット上に投稿されておらず、公的な講演でも発表されていません」と著者たちは記しています。答えは 2 月 13 日に公開されます。

誰が行ったか：First Proof は、スタンフォード大学、コロンビア大学、EPFL、インペリアル・カレッジ・ロンドン、テキサス大学オースティン校、MathSci.ai、オーフス大学、イェール大学、カリフォルニア大学バークレー校、シカゴ大学、ハーバード大学の研究者らによって構築されました。

今日の AI システムはまだこれを行えません：GPT 5.2 Pro も Gemini 3.0 DeepThink も、現時点では FirstProof を解決することはできません。「私たちのテストによると、システムが回答を生成する機会が一度だけ与えられた場合、最も公的に利用可能な AI システムは、私たちが提示した多くの質問に答えるのに苦労している」と彼らは記述しています。

なぜこれが重要なのか – 創造性の部分的テスト：First Proof に注目する主な理由は、2026 年 1 月頃の先端的な人間の創造性をサンプリングする際に生態学的妥当性を持つからです。これらは、一部の人間が答えを見つけているものの、まだ多くの他の人間に結果を伝えていないような先端的な科学問題です。AI システムがこの種のテストでうまく機能できれば、それが人間が行うのと同じ創造的な飛躍の一部を近似できるという手がかりを与えてくれます。First Proof の著者には、これを定期的に実施してほしいと願っています。あるいは、最大主義的見解では、ほとんどの科学研究者は、結果を出す前に自分が取り組んでいる質問を公開し始めるべきでしょう。これにより、AI システムが同じ答えに到達できるかどうかについての情報が得られるからです。

First Proof の後、私は AI システムの評価の最前線は、問題を解決することから、「どの問題を解決すべきか」に関する問いを生み出すことへと移行しなければならないと想像しています。「研究とは単に明確に定義された古くからの問題（例えばフェルマーの最終定理など）への解を見つけることだけであるという一般的な概念とは対照的に、現代の研究における重要な部分の多くは、実際には何が質問なのかを特定し、それを答えられる枠組みを開発することである」と研究者たちは記述しています。

詳しく読む：First Proof (arXiv)。

ウェブサイト（First Proof）でさらに詳しく知る。

名声の蓋なしの目に見つからないよう祈れ。

[ハイパーフェームは、アップリフト期間 1〜3 年に最も顕著に現れた AI に駆動された現象である]

私たちはこれを「突然のハイパーフェーム」と呼んだ。アップリフトの間、AI は特定の人間のコンテンツと人格が、機械および生物の両方からの注意を向ける価値があると判断することがあった。そしてその時、ハイパーフェームが発動するのだ。

一夜にして人々は無名の中から引き抜かれ、公衆意識の最前線へと放り出される。彼らは眼球、デジタルなものであれそうであれ、雨あられのように襲われる。富。スポンサーシップ。

親たちはこれを誘拐に例えた。ある日までは自分たちの十代の子どもだったのが、翌日にはデジタルエーテルを通じて彼らに手を伸ばすものによって操り人形のように糸を引かれる存在になっていたのだ。ハイパーフェームは若者も老人も、健康な人も病める人も、面白い人もあまりにも退屈して可笑しくなるほど退屈な人も問わず、数日、あるいは時には数時間だけ、彼らを世界で最も有名な存在へと変える。

そしてそれは、ある種の移動する蓋なしの目のように去っていく。新しい人々を見つけ、新たな注意を向けるのだ。かつてその影響を受けた人々は、しばしば物質的に変容したまま残される——今や莫大な富を得たものの——しかし彼らの世界全体もまた変わってしまった。街で認識されてから数年経ってもなお、彼らのオンライン上の存在は、残されたわずかな名声から注意を奪おうとする AI によって常に群れに囲まれた状態のままなのである。

人々は驚くほど速く名声に慣れ親しんでしまう。ハイパーフェームの力が去った後でも、ほとんどの人はそれを維持しようと必死になるだろう。そして、その影響を受けた人々は、それが去った時点で彼らが持っていた不名誉な地位を永遠に維持しようともがき、アルゴリズムという助けの手なしに過去の自分たちを演じさせられることになる。

この物語の着想となったもの：アテンションエコノミーと AI エージェントが結合したときに何が起きるか；モルトブック（Moltbook）；名声が人間の精神に及ぼす腐敗的な影響；Anthropic での仕事やプロファイルの高まりにより、時折街で認識されることへの私の恐怖、そしてこれが自分の認知能力にどのような影響を及ぼしうるかを頭の中で時計を進めて考えること。

お読みいただきありがとうございます！

原文を表示

imageWelcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Economist: Don’t worry about AI-driven unemployment, because people like paying for the ‘human touch’:

…Even when you have the technology to automate something, you might still pick a human…Adam Ozimek, chief economist at the Economic Innovation Group, has written a blog noting that even if AI gets much, much better and is capable of doing all the work that people do, there will still be some jobs for humans because people seem to have a preference for humans over machines in certain domains.

“There are many jobs and tasks that easily could have been automated by now – the technology to automate them has long existed – and yet we humans continue to do them,” he writes. “The reason is that demand will always exist for certain jobs that offer what I call “the human touch.”

Some examples here: Live music, actors, waiters, travel agents, and many types of sales job. And it seems like as you want to spend more and more on a given good or experience, you may want more contact with people: “the human touch also appears to be what economists call a “normal good,” which means the demand for it goes up as income goes up,” he writes. Some examples here might include fancy restaurants, and other concierge–like experiences.

Why this matters – one path through the AI revolution could be a rise in human-to-human work: My assumption is that ‘people like people’, and there is a high chance that even if AI automates huge chunks of the current economy there will be a boom in demand for ‘human artisans’ for a range of new jobs we can’t yet imagine, and for refinement of existing human professions. There’s also a chance that through a combination of economic growth and progressive policy work from governments that wages for these jobs could go up massively.

Read more: AI and the Economics of the Human Touch (Agglomerations, Substack).

Facebook makes a better recommender system, and figures out some recommender scaling laws:

…Kunlun is another nice example of what industrial AI looks like…

Facebook has published details on Kunlun, a recommendation system which is more efficient than previous ones developed by the ad behemoth. Along with this, Facebook has also figured out a predictable ‘scaling law’ for Kunlun models, making it easier for the company to invest hitherto unprecedented compute in these models for a more predictable return. This is a big deal because recommendation systems are what companies like Facebook use for advertising, which is both a) how they make the vast majority of their money, and b) has a tremendous impact on the buying and attention habits of the billions of people that use Facebook and other social platforms.

Recommenders are different to LLMs: We’ve had scaling laws for LLMs like Claude and ChatGPT for a while, but it’s been harder to develop the same scaling laws for recommender models. This is because recommender models work quite differently to LLMs, and so building scaling models here is “an open challenge for systems that jointly model both sequential user behaviors and non-sequential context features”.

Recommender models also tend to be a lot less efficient than LLMs: Recommendation systems achieve only 3-15% Model FLOPs Utilization (MFU), compared to 40-60% for LLMs, due to heterogeneous feature spaces resulting in small embedding dimensions, irregular tensor shapes, and memory-bound operations

Kunlun: The bulk of the paper involves a discussion of the design of Kunlun, which is basically a well optimized recommender system with resulting better MFU. Kunlun contains a Kunlun Transformer Block for context-aware sequence modeling via GDPA-enhanced personalized feed-forward networks and multi-head self-attention, as well as a Kunlun Interaction Block “for bidirectional information exchange through personalized weight generation, hierarchical sequence summarization, and global feature interaction”. There are a bunch of other tricks Facebook used to build Kunlun and you can read the paper to learn more. Ultimately, Kunlun improves MFU from 17% to 37% on NVIDIA B200 GPUs.

Why this matters – a scaling law for money: The key insight in the paper is that Kunlun models scale predictably, exhibiting the kind of power-law scaling behavior that language models exhibit. But where with LLMs scaling laws are typically assessed via a reduction in loss on an underlying dataset, here its normalized entropy (NE). In Facebook experiments, they discover reliable scaling laws for both NE gains in terms of the amount of gigaflops dumped into training the model, as well as related scaling laws for improvement in NE according to the number of layers used.

The Kunlun models have been “deployed across major Meta Ads models, delivering a 1.2% improvement in topline metrics”.

What we’re seeing here is the optimization of some of the most societally significant AI systems in the world – ones which direct billions of eyeballs towards a variety of products and online information – colliding with a greater degree of performance predictability; by developing these scaling laws, Meta has made it easier for it to spend even more compute on making these models even better, by making the investments in them more predictable in terms of the intelligence return on capital investment.

Read more: Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design (arXiv).

Superintelligence could save and extend lives, so we should go for it:

…Pausing or slowing down might make sense at the very end of the exponential, but it’s risky…

Nick Bostrom, an academic who introduced many people to the notion of superintelligence and AI risk, has written a paper laying out the idea that if superintelligence can improve human health, then it’s worth pursuing even if there’s a non-zero chance of it causing the death of the species.

“Yudkowsky and Soares maintain that if anyone builds AGI, everyone dies. One could equally maintain that if nobody builds it, everyone dies”, Bostrom writes in Optimal Timing for Superintelligence. “If the transition to the era of superintelligence goes well, there is tremendous upside both for saving the lives of currently existing individuals and for safeguarding the long-term survival and flourishing of Earth-originating intelligent life. The choice before us, therefore, is not between a risk-free baseline and a risky AI venture. It is between different risky trajectories, each exposing us to a different set of hazards.”

Why we should pursue superintelligence, even with a chance of doom: If you think about all the humans alive today and the different life expectancies they experience – especially those in the developing world – then you’re drawn to the view that every moment you waste in deploying superintelligence, you increase human suffering.

“When we take both sides of the ledger into account, it becomes clear that our individual life expectancy is higher if superintelligence is developed reasonably soon. Moreover, the life we stand to gain would plausibly be of immensely higher quality than the life we risk forfeiting,” Bostrom writes.

Key variables: The key variables here are, of course, the risk of a superintelligence killing us all, and also the rate at which safety research can reduce this chance. Under this view, developing superintelligence becomes a favorable thing to do under most circumstances.

The speed of progress and maturity of AI safety research may have some impact on the timeline: “When the initial risk is low, the optimal strategy is to launch AGI as soon as possible – unless safety progress is exceptionally rapid, in which case a brief delay of a couple of months may be warranted. As the initial risk increases, optimal wait times become longer. But unless the starting risk is very high and safety progress is sluggish, the preferred delay remains modest—typically a single-digit number of years”.

On pausing – and the dangers and benefits thereof: Many people in the AI safety community want to have some kind of pause of AI development to buy more time for AI safety research. Bostrom is quite skeptical that a pause will be effective and outlines some of the undesirable effects it could have:

Too early: If you do it early, people think pauses are ineffective.

Bad regulation: You choke off or delay good things in the future due to bad regulation.

Pause, except for natsec: Very little broad social benefit, but the military with access to powerful AI becomes very scary.

Prolonged danger: The world is exposed to risks from current AI without the defenses afforded by more advanced AI.

Why this matters – pausing may only make sense right at the end, and this is inherently risky: Bostrom eventually arrives at the view that to the extent you want to pause or slow development, it’s best to do this when you have the greatest amount of confidence that a pause would be effective and would contribute to reducing the chance of species death, and that it is not coming too early. This allows for the greatest amount of deliberation about how to roll out a superintelligence without risking an undue pause.

Critics of this view might say it’s akin to recommending someone try to catch a falling knife. If you catch the knife too early you experience a tremendous amount of pain. If you catch the knife too late you’ve missed your chance and gravity conspires with it to cause great harm to whatever is beneath you. You have to time things just right.

Bostrom summarizes his position as: “swift to harbor, slow to berth: move quickly towards AGI capability, and then, as we gain more information about the remaining safety challenges and specifics of the situation, be prepared to possibly slow down and make adjustments as we navigate the critical stages of scaleup and deployment. It is in that final stage that a brief pause could have the greatest benefit.”

Read more: Optimal Timing for Superintelligence (Nick Bostrom, PDF).

Can AI agents independently do basic AI research tasks? AIRS-BENCH says yes:

…And we can expect today’s models to be much better at this than the paper suggests…

Researchers with Meta, the University of Oxford, and University College London, have built and released the AI Research Science Benchmark (AIRS-BENCH), a way of testing out how well AI systems can complete contemporary machine learning tasks.

What AIRS-BENCH is made of: AIRS-BENCH tests out how well agents can solve 20 distinct tasks, sourced from 17 recent machine learning papers. The tasks span a variety of technical genres, including: molecules and proteins machine learning, question answering, text extraction and matching, time series, text classification, code, and math.

Some example tasks:

CodeGenerationAPPSPassAt5: Solve coding problems by generating five distinct Python programs for each problem.

CoreferenceResolutionWinograndeAccuracy: Identify which of two possible options a pronoun in a sentence refers to. It uses the Winogrande dataset, which contains sentences with an ambiguous pronoun and two possible answers.

TimeSeriesForecastingRideshareMAE: Perform time series forecasting over the Rideshare dataset, which is part of the Monash Time Series Forecasting Repository.

Results: Real problems, crappy models: This is a somewhat perplexing benchmark – the tasks are interesting and wrap in a lot of complexity. But the paper only tests out relatively bad models, such as the Code World Model, o3-mini, gpt-oss-20b, gpt-oss-120b, GPT-4o, and Devstral-Small 24B. This is a very funny set of models, and none of them are true frontier ones – one of the paper authors wrote on twitter “this took some time to get out“, so this could just be an artifact of slow publishing timelines.

In tests, none of the models are on par with the elo rating of a best-in-class human – but I’m not sure what to make of this till I see results with more powerful models.

Why this matters – models might produce different solutions to humans, and this is a cool way of studying if there’s a ‘scaling law’ here: One way this could be interesting is understanding the different ways models might solve tasks relative to humans. In one example, TextualClassificationSickAccuracy, models had to determine whether a pair of sentences have a relationship involving either entailment, contradiction, or no relationship.

SOTA from the literature is a person fine-tuning RoBERTa on the underlying training set and testing on the test set. By comparison, the best tested AIRS-BENCH agent, GPT-OSS-120B, “produces a two-level stacked ensemble that combines multiple transformer models and a meta-learner. RoBERTa-large and DeBERTa-v3-large are independently fine-tuned on the SICK training set. Each model processes sentence pairs and outputs logits for each class. The training is performed using 5-fold stratified cross-validation, ensuring robust out-of-fold (OOF) predictions and preventing overfitting. The logits from both base models are concatenated to form a feature vector for each example.”

This is extremely complicated! But it’s also interesting in that perhaps we can learn something about the progression in agents by looking at how the simplicity of their solutions to tasks might scale with size, where naively I’d expect more powerful models to arrive at simpler solutions. As Blaise Pascal once apocryphally said ““I have only made this letter longer because I have not had the time to make it shorter”.

Read more: AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (arXiv).

Math researchers see if AI can help solve their private solutions to frontier problems. The answer: Kind of.

…First Proof is a genuinely held out test set…

A group of mathematicians have built First Proof, a math test which sees how well AI systems can solve math problems for which there are no – until February 13th 2026 – published solutions.

What First Proof is: “We share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time,” the authors write. The questions are “drawn from the mathematical fields of algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, lattices in Lie groups, tensor analysis, and numerical linear algebra, each of which came about naturally in the research process for one of the authors”.

The authors believe First Proof is the first math benchmark “sampled from the true distribution of questions that mathematicians are currently working on”, and that it has the idiosyncratic advantage of secrecy – “each question has been solved by the author(s) of the question with a proof that is roughly five pages or less, but the answers are not yet posted to the internet,” they write, nor have the answers been presented in public talks.

The authors will release the answers on February 13.

Who did it: First Proof was built by researchers with Stanford, Columbia, EPFL, Imperial College, University of Texas at Austin, MathSci.ai, Aarhus University, Yale University, University of California at Berkeley, University of Texas at Austin, University of Chicago, and Harvard University.

Today’s AI systems can’t yet do it: Neither GPT 5.2 Pro or Gemini 3.0 DeepThink can solve FirstProof – yet. “Our tests indicate that – when the system is given one shot to produce the answer – the best publicly available AI systems struggle to answer many of our questions,” they write.

Why this matters – a partial test of creativity: The main reason to care about First Proof is that it is ecologically valid when it comes to sampling frontier human creativity circa January 2026 – these are some frontier scientific problems for which some humans have figured out answers, but have not yet told many other humans about their results. If AI systems are able to do well at this kind of test, it gives us a clue that they can approximate some of the same creative leaps which humans make. I hope the authors behind First Proof do this regularly – perhaps in a maximalist view, most scientific researchers should start publishing the questions they’ve been working on ahead of the results, as this will give us information as to if AI systems can arrive at these same answers.

After First Proof, I imagine the frontier of evaluating AI systems will have to move from solving problems to generating questions about which problems to solve: “Contrary to the popular conception that research is only about finding solutions to well-specified, age-old problems (e.g., Fermat’s Last Theorem), most of the important parts of modern research involve figuring out what the question actually is and developing frameworks within which it can be answered,” the researchers write.