初の証明提出
AIモデルが数学チャレンジ「First Proof」に挑戦した証明を公開。専門家レベルの問題で研究水準の推論能力をテスト。
キーポイント
OpenAIが研究レベルの数学問題チャレンジ「First Proof」に自社モデルで挑戦し、10問中少なくとも5問で正解の可能性が高い証明を提出したこと
AIの能力評価において、従来のベンチマークでは測れない「長い推論の持続」「適切な抽象化の選択」「専門家の審査に耐える論証の構築」といった研究の本質的な難しさをテストする意義を強調していること
現在トレーニング中の新モデルは、思考の厳密性を高め、数時間にわたる継続的思考と結論への高い確信度を目指しており、First Proofはその理想的なテストベッドとなったこと
影響分析・編集コメントを表示
影響分析
この成果は、AIが専門性の高い領域で複雑な論証を構築する能力が急速に進化していることを示す重要なマイルストーンである。研究開発の最前線における評価方法そのものに影響を与え、AIの科学的推論能力に対する期待を一段と高める内容と言える。
編集コメント
「ベンチマークでは測れない研究の本質的難しさ」をテストするという視点が秀逸。AIの能力評価のパラダイムシフトを感じさせる発表であり、今後の研究開発の方向性に大きな影響を与えそうだ。
OpenAIは、AIが専門的な数学問題において検証可能な証明を生成できるかを試す「First Proof」という数学チャレンジに挑戦し、その結果を公表した。First Proofは10問から成る研究レベルの難問であり、短答式や競技数学とは異なり、特定の専門領域における端から端までの論証構築が要求される。問題の作成者は各分野の第一人者であり、少なくとも数問は作者自身も解決に数年を要したもので、関連する学術部門でも一週間を要するとされる難易度である。
OpenAIは内部モデルを用いて全10問に取り組み、2026年2月14日に証明の試みを公開した。専門家によるフィードバックに基づくと、少なくとも5問(問題4、5、6、9、10)については証明が正しい可能性が高いと判断され、他数問も検討中である。当初は問題2も正しいと考えていたが、公式の解説やコミュニティ分析を経て誤りであると結論づけた。全ての証明試行はプレプリントとして公開されており、プロンプトのパターンやモデルとの対話例も付録に追加されている。
同社は、次世代AIモデルの能力を評価する上で、最先端の研究課題への挑戦が最も重要だと強調する。既存のベンチマークは有用だが、研究の最も困難な部分——長い推論連鎖の持続、適切な抽象化の選択、問題文の曖昧さの扱い、専門家の審査に耐える論証の生成——を捉えきれない場合がある。First Proofのような最先端課題は、正しさの検証が容易でなく、失敗例からも学びが多い環境で、それらの能力を圧迫試験するのに役立つ。
また、現在開発中の新モデルは、思考の厳密性を高め、数時間にわたる継続的思考と結論への高い確信を目標として訓練されている。First Proofはその恰好の試験場となり、開発初期の段階で既に2問(問題9、10)を解き、訓練の進展に伴いさらに少なくとも3問を解決できるようになった。特に問題6と4の解決は、関係者にも親しい分野であったため、大きな喜びとなった。この挑戦を通じ、AIの推論能力の前進が示されたとしている。
原文を表示
Our First Proof submissions | OpenAISwitch toChatGPT(opens in a new window)
API Platform(opens in a new window)
We’re sharing our proof attempts for First Proof, a math challenge testing if AI can produce checkable proofs on domain-specific problems.
(opens in a new window)Loading…ShareWe ran an internal model on all 10 First Proof(opens in a new window) problems, a research-level math challenge designed to test whether AI systems can produce correct, checkable proof attempts. Unlike short-answer or competition-style math, these problems require building end-to-end arguments in specialized domains, and correctness is hard to establish without expert review. The authors of the First Proof problems are leading experts in their respective fields, and at least a couple of the problems were open for years before the authors found solutions. An academic department that has substantial overlap with the subject areas could conceivably solve many of the problems in one week.
We shared(opens in a new window) our proof attempts on Saturday, February 14, 2026 at 12:00 AM PT. Based on feedback from experts, we believe at least five of the model’s proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, and several others remain under review. We initially believed our attempt for problem 2 was likely correct. Based on the official First Proof commentary and further community analysis, we now believe it is incorrect. We’re grateful for the engagement and look forward to continued review. Our full set of proof attempts can be found here(opens in a new window). The preprint includes all ten proof attempts, plus a newly added appendix with prompt patterns and examples that aim to simulate our manual interactions with the models during the process.
We believe novel frontier research is perhaps the most important way to evaluate capabilities of next generation AI models. Benchmarks are useful, but they can miss some of the hardest parts of research: sustaining long chains of reasoning, choosing the right abstractions, handling ambiguity in problem statements, and producing arguments that survive expert scrutiny. Frontier challenges like First Proof help us stress-test those capabilities in settings where correctness is nontrivial to verify and the failure modes are informative.
“We’re currently training a new model for which a primary focus is increasing the level of rigor in its thinking, with the goal that the model can think continuously for many hours and remain highly confident in its conclusions. When the First Proof problems were announced, it seemed like the perfect testbed, so over the weekend I tried it out. Already it was able to solve two of the problems (#9 and #10). As it trained, it became increasingly capable, eventually solving–in our estimation–at least three more. We were particularly pleased when it solved #6 and then, two days later, #4, as those problems were from fields familiar to many of us. It’s pretty incredible to watch a model get tangibly smarter day by day.”
– James R. Lee (OpenAI Researcher, Reasoning)
We ran the model with limited human supervision. When prompting versions of the model along training, we sometimes suggested retrying strategies that appeared fruitful in earlier attempts. For some attempts, we asked the model to expand or clarify parts of a proof after receiving expert feedback, to make the reasoning easier to verify. We also facilitated a back-and-forth between this model and ChatGPT for verification, formatting, and style. For some problems, we present the best of a few attempts, selected by human judgment. This was a fast sprint, and our process was not as clean as we would like in a properly controlled evaluation. We look forward to discussions with the First Proof organizers about a more rigorous experiment and evaluation framework for future iterations.
This work builds on earlier results from frontier reasoning models in math and science. In July 2025, we reached gold medal-level performance(opens in a new window) on the International Mathematical Olympiad with a general-purpose reasoning model (35/42 points). In November 2025, we shared “Early experiments in accelerating science with GPT‑5”, a set of case studies where GPT‑5 helped researchers make concrete progress across math, physics, biology, and other fields, along with the limitations we observed. And most recently, we reported a physics collaboration where GPT‑5.2 proposed a candidate expression for a gluon-amplitude formula that was then formally proved by an internal model and verified by the authors.
We look forward to deeper engagement with the community on how to evaluate research-grade reasoning, including expert feedback on these attempts, and we’re excited to make these new capabilities available in future public models.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み