読み込み中…

Latent Space·2026年6月4日 04:27·約11分

非公式 AI の限界を超える - Carina Hong, Axiom Math

#Reasoning #Formal Verification #AGI #Mathematics #Lean

TL;DR

Axiom の Carina Hong は、数学的証明の形式検証が AI の知能を「増幅・蓄積」させる鍵であると主張し、コード生成能力だけでは AGI に至らないボトルネックとして指摘している。

AI深層分析2026年6月4日 09:20

重要/ 5段階

深度40%

キーポイント

Axiom の Putnam 試験での驚異的成績

設立わずか 7 ヶ月のスタートアップ Axiom が、難問で知られる数学試験 Putnam で満点（12/12）を達成し、トップ学生や既存の AI システムを上回る結果を出した。

形式検証による知能の「増幅」と「蓄積」

Carina Hong は、Ramanujan の例を引き合いに出し、直観を形式証明（Lean 等）で裏付けることが思考の詳細化を促し、他者がその成果を積み重ねることで AI の知能が爆発的に向上すると説く。

コード生成能力の限界と AGI のボトルネック

Anthropic が推進するコード生成や加速への投資は重要だが、Carina はこれを AGI への必要条件ではあるが十分条件ではないとし、形式検証がない状態での進歩には根本的な限界があると指摘している。

Verified AI の二つの実装アプローチ

Axiom の「Verified AI」は、学習（トレーニング）時と推論（インフェレンス）時の両方で形式検証を適用するアプローチであり、堅牢な基盤の上に知能を構築することを目的としている。

形式検証による強化学習の強化

統計的な推測に依存する従来の手法（GRPO, RLHF）に対し、Lean 検証器を用いて証明が正しいかを直接確認することで、より強力な報酬信号と高いサンプル効率を実現できる。

Axiom の圧倒的なベンチマーク性能

Verina コード生成ベンチにおいて Axiom は 99%（187/189）の成功率を達成し、OpenAI o3 の 4.9% を大きく上回っており、Lean 証明生成における先行性を示している。

検証に基づく学習の増幅効果

形式化された証明は高品質なトレーニングセットとして蓄積され、将来の推論や訓練がその上に積み重ねられるため、統計的手法では得られない「複利効果」を生み出す。

重要な引用

"Verification to me is about scaling brilliance, compounding brilliance."

"Formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking... This is how you 'compound' in math."

"Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps that Carina believes will bottleneck AI progress."

Instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier.

Better proofs → better Lean generation → better RL. A stronger signal means higher sample efficiency and higher maximum performance.

Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.

影響分析・編集コメントを表示

影響分析

この記事は、単なるベンチマークスコアの向上を超え、AGI 開発における「形式検証（Formal Verification）」の重要性を再定義する重要な示唆を含んでいます。コード生成能力への過度な依存から脱却し、数学的厳密性を AI の基盤に組み込むことで、より信頼性が高く、自己増幅的な知能を実現しようとする Axiom のアプローチは、業界全体の開発パラダイムシフトを促す可能性があります。

編集コメント

「コード生成が AGI の全てではない」という逆説的な視点と、数学的証明による AI 能力の増幅理論は、現在の主流開発路線に対する重要な警鐘となっています。

2025 年、設立から 7 ヶ月しか経っていないスタートアップである Axiom は、権威ある学部生向け数学試験であるプットナム試験（Putnam exam）の全 12 問を解決しました。制限時間内に 8/12 のスコアを獲得したのです。この 12/12 という得点は、上位の学部生たち（110/120）や結果を発表した最も近い AI システム（DeepSeek 103/120）よりも優れていますが、もし時間制限がなければ人間や他のシステムがどの程度のスコアを出せたかは不明です。しかしながら、プットナム試験はその難易度で伝説的な地位を築いており、通常は中央値の得点が 0 または 1 点に過ぎません。これ単独で見れば、これは AI の功績の一つとしてさほど大したものではありません。カスパロフを破ったディープブルー（Deep Blue）から始まる、人間とのエリート競技における AI システムの一連の成果のうちのひとつに過ぎないからです。

時を 2026 年半ばまで進めると、Claude Code や Codex が世界を席巻しています。2024 年当時、コードと企業向け市場への Anthropic の賭けは、より優れたモデルと巨大な消費規模を持つ OpenAI と比較すると、より現実的なニッチ戦略のように見えました。今日では、アモデイ（Amodei）が画像や動画などをお構いなしにコードを通じた加速に全賭けしている姿が、先見の明があったように思えます。

しかし Anthropic の勢いが拡大する中でも、Axiom の CEO である Carina Hong は、コーディング能力は AGI（人工一般知能）への道における必要不可欠ではあるが不十分なマイルストーンであると捉えています。コードは、コーディング以外のいくつかの領域において、その荒々しいフロンティアをスーパーインテリジェンス（超知能）のレベルまで押し上げる可能性がありますが、Carina はそこに AI の進展をボトルネックにする驚くべきギャップが存在すると信じています。（数学ベンチマークに関する統計データあり）

非公式なボトルネック

「検証済み AI」という言葉は、ブロッコリーを食べたり税金を納めたりするようなもののように聞こえるかもしれませんが、Axiom にとっては全く異なる意味を持ちます。「私にとっての検証とは、卓越性の拡大と、その卓越性を複利化することです」と Carina は語りました。

実際には、彼女が何を言っているのかを理解するのに少し時間がかかりました（最初はマーケティング用語にしか聞こえなかったのですが、ふと納得した瞬間がありました）。Carina はこの点を説明するために、伝説の数学者スリニヴァーサ・ラマヌジャン（「無限を知る男」）を取り上げます。G.H. ハーディがついにラマヌジャンを説得して、頼りきりの（しかし恐るべき）直感に依存するのではなく、定理の形式的証明を行うようにさせたとき、それは彼の自身の能力を向上させたと伝えられています。おそらく、何かを形式的に証明することが、ラマヌジャンに新しい思考の道を開くような形で詳細を言語化することを強いたからでしょう。これが数学における「複利化」の方法です——不安定な基盤ではなく、堅固な基盤の上に築き上げるのです……つまり、公理（Axioms）と呼ばれるものです。

しかし、何かを形式的に証明することは、他の人々が彼の直感から恩恵を受けることも可能にしました：証明とは、直感を伝達し、その直感が正しいと他者を説得するための手段です。これが拡大（より多くの人々が結果を利用する）であり、複利化（人々が彼の業績から学び、それを基盤として構築できる）です。

これが、Axiom が取り組んでいるアプローチを理解させる核心的な洞察です。

検証済み生成

検証済み AI は、トレーニングと推論の 2 つの形で現れます。

ただし、少し立ち止まって考えてみましょう：第一近似として、「形式検証」とは、Lean2 などの言語で綿密に指定された数学的証明を、型チェッカー（TypeScript、C++、または Rust のようなものですが、より高度な機能を持つ）を用いて検証することを意味します。"非公式"の証明（ただし、ほとんどの人がそれを「非公式」とは呼ばないでしょう）を Lean による証明に変換するには、多大な労力が必要です。Axiom 自体も、数学的証明を検証・検証・操作するための対話型 Lean アプリケーションであるツールキット AXLE をオープンソース化し、画期的な成果を発表しています。

これは強化学習（Reinforcement Learning）において非常に（そして極めて）有用であることが想像できます：統計に基づく最善の推測（GRPO、RLHF など）に頼るのではなく、Lean 検証器を用いて証明が正しいことを直接確認できるからです。これは明らかに、はるかに強力な報酬信号であり、コードをコンパイルしてテストする行為（コーディングにおける強化学習で通常行われること）に似ています。

しかし、落とし穴があります：LLM は現在、Lean を用いた証明にはあまり得意ではありません。

ここに Axiom が登場します：彼らは 12/12 の Putnam 結果以外について公式にベンチマーク数値を発表していませんが、Carina によると、Verina コード生成ベンチマークにおいて ProofGen で非常に印象的な 99%（187/189）を達成しています。このベンチマークは、一連の問題に対してコードとその正しさの証明を生成するものです。参考までに、OpenAI o3（現時点で知られている最新の OpenAI の実行結果）はこのベンチマークで 4.9% を記録しました。

限定的なベンチマークに基づくと、年間の IMO（国際数学オリンピック）の目標以外で最前線のラボが現在どのように取り組んでいるかを言うのは難しいですが、Carina によれば、彼らはまだ Lean 証明を直接生成するために訓練しているわけではなく、非形式的な証明に依存しているとのことです。

このギャップを埋めるために最前線のラボの現在のアプローチが有効かどうかは、時間の経過によって明らかになるでしょう。

スケーリングと複利効果

Carina のラマヌジャンのアナロジーは非常に直接的です。より良い証明 → より良い Lean 生成 → より良い強化学習（RL）。より強力なシグナルとは、サンプル効率の向上と最大性能の高まりを意味します。素晴らしいことです。

スケーリングについても明確です。一度 Lean で何かを証明すれば、出力の品質は人間から得られた場合とほぼ同等に高くなるため、非形式的なロールアウト・コーパスでは不可能な形で、高品質なトレーニングセットが成長したことになります。私は Lean 証明を信頼できます。

複利効果についても同様です。これからは、将来の推論も訓練も、これらの証明の上に構築されていくことになります。

一方で、強化学習中に GRPO のような統計的シグナルのみを使用して訓練されたモデルは、形式検証を利用するシステムが享受できるサンプル効率、最大性能、そして複利効果をもたらすコーパスを欠いています。

すべての道は検証へと至る

ブロッコリーや税金の問題はさておき、検証というテーマは私たちの会話の多くで登場しています。物理システムの分野では、Applied Intuition の事例を思い出してください。

「検証可能性は、おそらく現在最も難しい問題だと思います。なぜなら、モデルが向上するにつれて、システム内の欠陥を見つけることがますます困難になるからです。したがって、これらの欠陥を見つけるための適切な評価を行うという問題も、モデルが向上するにつれてさらに難しくなり続けています。」

理論物理学の分野では、アレックス・ルプサスカ（Alex Lupsasca）の言葉を思い出します。

「…今や、ChatGPT に同時に数千問の質問を処理させることができるこの段階に至りました。すると、それらのかなりの割合に対して証明が返ってきます。実際、すべての出力を検証する責任は再び人間に戻ってきました。そして、それがボトルネックとなるにつれて、数学の形式化と検証の自動化がより価値を持つようになると思います。」

実際、科学のための AI と計算のための AI の決定的な違いは検証にあります：科学では、物理実験を実行して仮説を実際にテスト（検証）する必要があります。Radical AI や Lila などの「Lab in the loop」システムは、まさにこの前提に基づいて構築されています（これらのチームとのエピソードを録画しており、近日公開予定です！）。

また、飛行制御、原子力発電所、ペースメーカーといった重要なシステムの形式検証も、それらを動かすソフトウェアとハードウェアが複雑化するにつれて、注目を集める分野となっています。

Carina は、AGI には検証された生成が必要だと強く信じており、「他の可能な未来は存在しない」と断言しています。

生成には高コストがかかるが、検証は安価である

Lean による証明の生成は困難ですが、その正誤を簡単に検証することは可能です。しかし、あなたが作成した証明が、あなたが関心を持つ問題に正確に対応していることをどうやって知るのでしょうか？Carina の言葉で言えば：「指定可能なものはすべて証明可能である。人間は自分が望むものをすべて指定するのが苦手だ。」

私たちは今や仕様のビジネスにいるのでしょうか？Carina の見解を聞くにはエピソードをチェックしてください。また、以下の点についても触れられています：

なぜハードウェア検証がキラーアプリとなるのか

AXLE オープン API と最近リリースされた Discovery ツールキットの詳細

Erdos 事件の顛末

OpenAI GPT-f のディアスポラ（離散）

完全なビデオポッドキャスト

タイムスタンプ：

0:00 イントロダクション：2 億ドルシリーズ A と数学スタートアップの仮説

4:52 検証済み AI：輝きのスケーリング、不備の修正ではない

13:42 Axiom のシステム：Lean データ、強化学習（RL）、そしてプットナム満点

22:12 数学的発見 — 予想以前に

25:12 ライスの定理、不完全性、および実用的な限界

30:42 証明付きコード — Verina ベンチマーク

37:57 証明木、コンテキストウィンドウ、そしてスケーリングの限界

43:57 市場、参入障壁（モート）、そしてビジネスケース（16 億ドルの評価額）

55:27 個人的なルーツ：オックスフォード大学、UCL ガツビー研究所、スタンフォード法科大学院

1:00:57 Erdos 論争と検索の難しさ

1:06:02 数学における AlphaZero、自己改善

1:08:47 スタートアップの優位性と OpenAI GPTF のスレッド

1:13:17 Axle API — スケーラブルな Lean 向けのオープンインフラストラクチャ

1:20:47 コラボレーション、ポリマット（多面手）、そしてボトルネックとしての人間の注意

1:22:21 創業ストーリー：執念、法科大学院、そして Julie Zhuo

1:26:17 より大きなビジョン — AGI、科学、そして転移学習

1:35:02 ボトルネック、分断、そして分野の未来

1 私は実はブロッコリーが大好きですが、同時にテスト駆動開発（Test Driven Development）を強く信じているので ¯\(ツ)/¯

2 形式検証には、モデル検査（TLA+、SPIN）、SMT ベースのツール（Dafny、F*、Why3）、およびリファインメント型システム（Liquid Haskell）も含まれます。これらは多くの場合、ユーザーの視点からは「証明のタイプチェック」のように見えませんが、その背後には類似した論理的な核が存在します。また、純粋な数学だけでなく、ソフトウェアとハードウェアの正しさにも適用されます。

3 これは控えめな表現です。形式化が非常に困難であるため、ほとんどの定理は非公式のまま残されています。最も重要な証明を形式化しようとする多大な努力が払われてきましたが、その結果は様々でした。

4 証明が LLM の分布内にあるという点から、少し低くなる可能性もあると主張する人もいるかもしれません。

原文を表示

In 2025, seven-month-old startup Axiom solved all 12 of the problems Putnam exam (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov.

Fast forward to mid-2026, and Claude Code and Codex are setting the world on fire. In 2024 Anthropic’s bet on code and enterprise looked like a more pragmatic niche play vs. OpenAI’s better models and massive consume scale. Today, Amodei’s all in bet on acceleration via code (images and video be damned) seems prescient.

Despite Anthropic’s growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks).

The informal bottleneck

“Verified AI” sounds like eating broccoli1 and paying taxes, but to Axiom it means something very different. “Verification to me is about scaling brilliance, compounding brilliance,” Carina told us.

It actually took a while for me to understand what she means by this (sounded like marketing-speak until it clicked). Carina brings up the legendary mathematician Srinivasa Ramanujan (“The Man who knew Infinity”) to illustrate this point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is how you “compound” in math — building on solid rather than shaky foundations… also known as Axioms.

But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work).

This is the core insight that lets us understand the approach Axiom is taking.

Verified Generation

There are two ways that Verified AI shows up: in training and in inference.

But a quick detour: to a first approximation, “Formal Verification” means using type checkers (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean2. It takes a lot of work to translate an “informal” proof (albeit one that most people would not remotely call “informal”) in to a Lean proof3. Axiom themselves have open sourced groundbreaking work with AXLE - their toolkit of interactive Lean applications for exploring, validating, and manipulating mathematical proofs.

You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding).

The catch: LLM are not (currently) very good at proving things with Lean.

Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on the Verina codegen benchmark. This benchmark is to generate code and proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark.

Based on the sparse benchmarking, it’s hard to say how the frontier labs are currently doing outside the annual IMO milestones, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs.

Time will tell if the frontier labs’ current approaches will close this gap.

Scaling and compounding

Carina’s Ramanujan analogy is pretty direct. Better proofs → better Lean generation → better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great!

Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically4 as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs.

Compounding is also clear: now all of future inference and training can build upon those proofs.

On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from.

All roads lead to verification

Broccoli and taxes notwithstanding, verification has shown up in a lot of our conversations. In the domain of physical systems, recall Applied Intuition:

“I think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.”

In theoretical physics, we recall Alex Lupsasca:

“…now that we’re in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.”

Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like Radical AI and Lila build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!)

And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex.

Carina believes so strongly that AGI requires verified generation that she makes the unqualified claim that “We do not believe there is any other possible future.”

Expensive to produce, cheap to verify

Lean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: “Anything that can be specified can be proven. Humans are bad at specifying everything we want.”

Are we now in the specification business? Check out the episode to hear Carina’s take, as well as:

Why hardware verification is a killer app

Details on the AXLE open API and recently released Discovery toolkit

The Erdos debacle

The OpenAI GPT-f diaspora

Full Video Podcast

Timestamps:

0:00 Intro: The $200M Series A and the Math Startup Thesis

4:52 Verified AI: Scaling Brilliance, Not Fixing Lousiness

13:42 Axiom’s System: Lean Data, RL, and the Putnam Perfect Score

22:12 Mathematical Discovery — Before the Conjecture

25:12 Rice’s Theorem, Incompleteness, and Practical Limits

30:42 Code With Proof — The Verina Benchmark

37:57 Proof Trees, Context Windows, and Scaling Limits

43:57 Markets, Moat, and the Business Case ($1.6B valuation)

55:27 Personal Origin Story: Oxford, UCL Gatsby, Stanford Law

1:00:57 The Erdos Controversy and the Difficulty of Search

1:06:02 AlphaZero for Math, Self-Improvement

1:08:47 Startup Advantage and the OpenAI GPTF Thread

1:13:17 Axle API — Open Infrastructure for Lean at Scale

1:20:47 Collaboration, Polymath, and Human Attention as the Bottleneck

1:22:21 Founding Story — Obsession, Law School, and Julie Zhuo

1:26:17 The Bigger Vision — AGI, Science, and Transfer Learning

1:35:02 Bottlenecks, Fragmentation, and the Field’s Future

1I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so ¯\(ツ)/¯

2Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) — many of which don’t look much like “type checking a proof” from the user’s perspective even when there’s a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.

3This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results.

4One might argue that its a bit lower because the proof is in distribution for the LLM.

この記事をシェア

TLDR AI重要度42026年7月17日 09:00

Schema、物理AI向けモデルハネス発表

MarkTechPost2026年7月20日 10:56

コミュニティが MiniCPM5-1B を微調整し、657MB の思考モデルを公開

MarkTechPost重要度42026年7月20日 07:20

Feyn AI が DB 事前検査型 Text-to-SQL モデル「SQRL」発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む