Synced Review·2025年5月1日 00:46·約5分で読める

DeepSeek、再帰的証明探索と新ベンチマークでニューラル定理証明を進展させる「DeepSeek-Prover-V2」を発表

#DeepSeek-Prover-V2 #Neural Theorem Proving #Lean 4 #Reinforcement Learning #Mathematical Reasoning

TL;DR

DeepSeek は Lean 4 環境向けに再帰的証明探索と新しいベンチマークを特徴とする大規模モデル DeepSeek-Prover-V2 を公開し、数学的推論能力において新記録を達成した。

AI深層分析2026年5月3日 02:11

最重要/ 5段階

深度40%

キーポイント

独自のコールドスタート生成手法

強力な V3 モデルを用いて複雑な定理をサブゴールに分解・形式化し、7B モデルで証明を検証することで、非公式推論と厳密な形式証明を組み合わせた高品質な学習データを合成している。

強化学習による推論能力の向上

合成データで微調整した後に、正誤フィードバックを用いた強化学習を実施し、直感的な数学的推論と厳密な形式証明構築のギャップを埋めることに成功している。

671B パラメータモデルによる SOTA 達成

DeepSeek-Prover-V2-671B は MiniF2F-test で 88.9%、PutnamBench で 49/658 の問題解決率を記録し、ニューラル定理証明における最高性能を確立した。

新ベンチマーク ProverBench の導入

数学的推論能力を評価するための新しいベンチマーク「ProverBench」が同時に発表され、今後の研究開発の基準となる。

DeepSeek-Prover-V2 の性能とモデル構成

671B パラメータのモデルは MiniF2F-test で 88.9%、PutnamBench で 49/658 問題を解決する SOTA を達成し、7B モデルも 32K トークンのコンテキスト長をサポートしています。

新しい評価ベンチマーク ProverBench の導入

AIME 問題や教科書例題を含む 325 問からなる新ベンチが公開され、高校レベルの競技数学から学部レベルの基礎数学までを網羅的に評価可能になりました。

再帰的証明探索パイプラインによる技術的進展

再帰的証明検索手法と新しいベンチマークの組み合わせにより、形式数学分野における AI システムの開発と評価がさらに高度化しています。

影響分析・編集コメントを表示

影響分析

この発表は、AI が単なる知識の再生成ではなく、厳密な論理構造を持つ数学的証明を生成・検証できる能力において飛躍的な進歩を遂げたことを示しています。特に、大規模モデルと小規模モデルを組み合わせたデータ合成手法は、高品質な専門領域データの不足という課題に対する実用的な解決策として、今後の AI 研究開発の重要な指針となるでしょう。

編集コメント

数学的推論能力における SOTA 達成は、AI が学術研究や形式検証の現場で実用化されるための重要なマイルストーンです。特にデータ生成プロセスの自動化は、専門分野における AI 応用のスケーラビリティを高める画期的なアプローチと言えます。

DeepSeek AI は、Lean 4 環境における形式定理証明のために特別に設計された画期的なオープンソース大規模言語モデル「DeepSeek-Prover-V2」のリリースを発表しました。この最新バージョンは、DeepSeek-V3 の力を活用して独自の高品質な初期化データを生成する革新的な再帰的定理証明パイプラインを導入することで、先行研究をさらに発展させたものです。その結果、ニューラル定理証明において最先端のパフォーマンスを達成し、数学的推論能力を評価するための新たなベンチマーク「ProverBench」も同時に導入されました。

先駆的なコールドスタート推論データ生成

DeepSeek-Prover-V2 の重要な革新点は、独自の冷間開始（コールドスタート）トレーニング手順にあります。このプロセスは、強力な DeepSeek-V3 モデルに対して複雑な数学的定理をより管理可能な一連のサブゴールに分解するようプロンプトを与えることから始まります。同時に、DeepSeek-V3 はこれらの高レベルな証明ステップを Lean 4 で形式化し、構造化された一連のサブ問題を実際に作成します。

各サブゴールに対する計算集約型の証明探索を処理するため、研究者たちはパラメータ数が 70 億の smaller モデルを採用しました。困難な問題の分解されたすべてのステップが成功裏に証明されると、完全な段階ごとの形式証明は、DeepSeek-V3 の対応する思考連鎖（chain-of-thought）推論とペアになります。この巧妙なアプローチにより、モデルは非公式で高レベルな数学的推論と厳密な形式証明の両方を統合した合成データセットから学習でき、その後の強化学習に対する強力なコールドスタートを提供します。

推論強化のための強化学習

合成されたコールドスタートデータを基盤として、DeepSeek チームは 70 億パラメータの証明モデルがエンドツーエンドで解決できなかったが、すべてのサブゴールが成功裏に処理された困難な問題の選定を行いました。これらのサブゴールの形式証明を組み合わせることで、元の問題に対する完全な証明が構築されます。この形式証明は、リーマ分解（lemma decomposition）を示す DeepSeek-V3 の思考連鎖と結びつけられ、非公式推論に続く形式化という統一されたトレーニング例となります。

その後、証明モデルはこの合成データでファインチューニングされ、強化学習段階へと進みます。この段階では、正解か不正解かの二値フィードバックを報酬信号として利用し、非公式な数学的直感と形式証明の精密な構築との間のギャップを埋めるモデルの能力をさらに洗練させます。

最先端のパフォーマンス

この革新的なトレーニングプロセスの集大成が、6710億パラメータを備えた DeepSeek-Prover-V2–671B です。このモデルは顕著な成果を達成し、ニューラル定理証明において最先端の性能を示しました。MiniF2F-test においては驚異的な 88.9% のパス率を記録し、PutnamBench から抽出された 658 問のうち 49 問を成功裡に解決しています。DeepSeek-Prover-V2 が miniF2F データセットに対して生成した証明は公開されており、ダウンロードしてさらなる検証や分析を行うことが可能です。

ProverBench の導入：評価の新たな基準

モデルの公開に加え、DeepSeek AI は ProverBench と呼ばれる新しいベンチマークデータセットを導入しました。これは 325 問の問題から構成されており、異なる難易度レベルにおける数学的推論能力をより包括的に評価するために設計されています。

ProverBench には、最近の AIME（American Invitational Mathematics Examination：米国招待数学試験）競技会（AIME 24 および 25）から形式化された 15 問が含まれており、高校レベルの競技会に匹敵する本格的な課題を提供します。残りの 310 問は、精選された教科書の例題や教育用チュートリアルから抽出されており、多様な分野にわたる形式化された数学問題の教育的にも妥当で多様性に富んだコレクションとなっています。

ProverBench は、困難な競技問題と基礎的な学部レベルの数学の両方において、ニューラル定理証明器をより包括的に評価することを目的としています。

利用可能性

DeepSeek AI は、異なる計算リソースに対応するため、DeepSeek-Prover-V2 を 7B パラメータモデルと、より大規模な 671B パラメータモデルの 2 つのサイズでリリースします。DeepSeek-Prover-V2–671B は、堅牢な基盤である DeepSeek-V3-Base に基づいて構築されています。一方、小規模な DeepSeek-Prover-V2–7B は DeepSeek-Prover-V1.5-Base を基盤とし、最大 32K トークンの拡張されたコンテキスト長を特徴としており、より長く複雑な推論シーケンスの処理を可能にします。

DeepSeek-Prover-V2 のリリースと ProverBench の導入は、ニューラル定理証明の分野における重要な前進を示しています。再帰的証明探索パイプラインを活用し、困難な新ベンチマークを導入することで、DeepSeek AI はコミュニティに対し、形式数学のためのより洗練され能力の高い AI システムの開発と評価を可能にしています。

リンク：https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

本記事「DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark」は、Synced に最初に掲載されました。

原文を表示

DeepSeek AI has announced the release of DeepSeek-Prover-V2, a groundbreaking open-source large language model specifically designed for formal theorem proving within the Lean 4 environment. This latest iteration builds upon previous work by introducing an innovative recursive theorem-proving pipeline, leveraging the power of DeepSeek-V3 to generate its own high-quality initialization data. The resulting model achieves state-of-the-art performance in neural theorem proving and is accompanied by the introduction of ProverBench, a new benchmark for evaluating mathematical reasoning capabilities.

Pioneering Cold-Start Reasoning Data Generation

A key innovation of DeepSeek-Prover-V2 lies in its unique cold-start training procedure. This process begins by prompting the powerful DeepSeek-V3 model to decompose complex mathematical theorems into a series of more manageable subgoals. Simultaneously, DeepSeek-V3 formalizes these high-level proof steps in Lean 4, effectively creating a structured sequence of sub-problems.

To handle the computationally intensive proof search for each subgoal, the researchers employed a smaller 7B parameter model. Once all the decomposed steps of a challenging problem are successfully proven, the complete step-by-step formal proof is paired with DeepSeek-V3’s corresponding chain-of-thought reasoning. This ingenious approach allows the model to learn from a synthesized dataset that integrates both informal, high-level mathematical reasoning and rigorous formal proofs, providing a strong cold start for subsequent reinforcement learning.

Reinforcement Learning for Enhanced Reasoning

Building upon the synthetic cold-start data, the DeepSeek team curated a selection of challenging problems that the 7B prover model couldn’t solve end-to-end, but for which all subgoals had been successfully addressed. By combining the formal proofs of these subgoals, a complete proof for the original problem is constructed. This formal proof is then linked with DeepSeek-V3’s chain-of-thought outlining the lemma decomposition, creating a unified training example of informal reasoning followed by formalization.

The prover model is then fine-tuned on this synthetic data, followed by a reinforcement learning stage. This stage utilizes binary correct-or-incorrect feedback as the reward signal, further refining the model’s ability to bridge the gap between informal mathematical intuition and the precise construction of formal proofs.

State-of-the-Art Performance

The culmination of this innovative training process is DeepSeek-Prover-V2–671B, a model boasting 671 billion parameters. This model has achieved remarkable results, demonstrating state-of-the-art performance in neural theorem proving. It reached an impressive 88.9% pass ratio on the MiniF2F-test and successfully solved 49 out of 658 problems from PutnamBench. The proofs generated by DeepSeek-Prover-V2 for the miniF2F dataset are publicly available for download, allowing for further scrutiny and analysis.

Introducing ProverBench: A New Standard for Evaluation

In addition to the model release, DeepSeek AI has introduced ProverBench, a new benchmark dataset comprising 325 problems. This benchmark is designed to offer a more comprehensive evaluation of mathematical reasoning capabilities across different levels of difficulty.

ProverBench includes 15 problems formalized from recent AIME (American Invitational Mathematics Examination) competitions (AIME 24 and 25), providing authentic challenges at the high-school competition level. The remaining 310 problems are drawn from curated textbook examples and educational tutorials, offering a diverse and pedagogically sound collection of formalized mathematical problems spanning various areas:

ProverBench aims to facilitate a more thorough assessment of neural theorem provers across both challenging competition problems and fundamental undergraduate-level mathematics.

Availability

DeepSeek AI is releasing DeepSeek-Prover-V2 in two model sizes to cater to different computational resources: a 7B parameter model and the larger 671B parameter model. DeepSeek-Prover-V2–671B is built upon the robust foundation of DeepSeek-V3-Base. The smaller DeepSeek-Prover-V2–7B is built upon DeepSeek-Prover-V1.5-Base and features an extended context length of up to 32K tokens, allowing it to process longer and more complex reasoning sequences.

The release of DeepSeek-Prover-V2 and the introduction of ProverBench mark a significant step forward in the field of neural theorem proving. By leveraging a recursive proof search pipeline and introducing a challenging new benchmark, DeepSeek AI is empowering the community to develop and evaluate more sophisticated and capable AI systems for formal mathematics.

Link：https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

The post DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark first appeared on Synced.

この記事をシェア

Synced Review★42025年8月14日 15:31

どのエージェントがタスク失敗を引き起こし、いつか？PSUとDukeの研究者はLLMマルチエージェントシステムの自動失敗帰属を探究

ペン州立大学とDuke大学の研究者らは、大規模言語モデル（LLM）を用いたマルチエージェントシステムにおいて、タスク失敗の原因となる特定のエージェントと発生時期を自動で特定する手法を開発した。

Synced Review★42025年6月24日 18:17

ByteDance、自律ロボットナビゲーション向け二重モデル「Astra」を発表

ByteDanceは、複雑な屋内環境での自律ロボットナビゲーション課題を解決するため、「Astra」を開発した。これは「現在地」「目的地」「経路」の3つの基本質問に答える二重モデルアーキテクチャであり、従来の手法の限界を克服する革新的な技術である。

Synced Review★42025年6月16日 21:58

MIT研究者が「SEAL」を発表：自己改善型AIへの新たな一歩

MITは「SEAL（Self-Adapting LLMs）」という新フレームワークを発表した。これは大規模言語モデル（LLM）が自身の重みを更新できる仕組みであり、自己進化型AIの実現に向けた重要な進展と見なされている。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む