MIT ML News·2026年4月25日 02:00·約9分で読める

MIT研究者らがオリンピックレベルの数学問題の世界最大コレクションを構築し、一般公開した

#数学推論 #大規模データセット #LLM評価 #オープンソース #MIT CSAIL

TL;DR

MIT、KAUST、HUMAINが共同開発した数学オリンピアード問題集「MathNet」は3万件以上の高品質な証明ベース問題を公開し、AIの数学推論能力評価と学習の新たな基準となる。

AI深層分析2026年4月25日 03:31

重要/ 5段階

深度40%

キーポイント

世界最大規模の数学データセット

MIT、KAUST、HUMAINが共同で3万件以上の証明ベース数学問題と解答を収集し、既存の5倍規模を実現した。

グローバルな多様性と質の確保

47カ国17言語143大会に跨る公式問題集を基盤とし、米国や中国中心だった既存データセットの偏りを解消した。

専門家による多角的な解答の提供

コミュニティフォーラム由来ではなく公式大会の解答を採用し、複数アプローチを含む詳細な解説でAIの推論学習を支援する。

影響分析・編集コメントを表示

影響分析

数学的推論能力の向上がLLM競争の次なる焦点となる中、MathNetは評価基準と学習素材の両面で業界に大きな影響を与える。公式大会の多様な解答プロセスを公開することで、AIが「なぜその解法に至ったか」を理解する基盤を整備し、研究コミュニティのオープンイノベーションを加速させる。

編集コメント

数学的推論のボトルネック解消には「解答のプロセス」が不可欠であり、公式大会の多様な解法を公開する本取り組みはLLM研究の重要なマイルストーンとなる。今後、ベンチマーク標準化と教育現場への波及が期待される。

毎年、International Mathematical Olympiad (IMO)に出場する各国は、自国の最も優れた独自の問題を収めた冊子を携えて訪れます。これらの冊子は代表団間で共有された後、静かに姿を消してしまう。これらを体系的に収集し、整備して公開した者はこれまで誰もおらず、数学的推論の限界を試すAI研究者（AI researchers）のためでもなく、世界中でほぼ独学でこれらの競技に備える学生たちのためでもなかった。

MITのComputer Science and Artificial Intelligence Laboratory (CSAIL)、King Abdullah University of Science and Technology (KAUST)、そして企業HUMAINの研究者たちが、まさにそれを実行した。

MathNetは、これまでに作成された証明ベースの数学問題（proof-based math problems）における最大かつ高品質なデータセット（dataset）である。47カ国、17言語、143の競技にまたがる3万超の専門家が作成した問題と解答を収録しており、同種の次点データセットの5倍の規模を誇る。この研究成果は、今月下旬にブラジルで開催されるInternational Conference on Learning Representations (ICLR)で発表される予定である。

MathNetが他のデータセットと異なるのは、その規模だけでなく、その広範さでもあります。以前のオリンピックレベルのデータセット（Olympiad-level datasets）は、ほぼ米国と中国の競技から抽出されたものだった。MathNetは6大陸にまたがる数十カ国の範囲をカバーし、17の言語に対応し、テキストおよび画像ベースの問題と解答を含み、40年にわたる競技数学（competition mathematics）の歴史を網羅している。このプロジェクトの目標は、最も目立つものだけでなく、世界中の数学コミュニティに存在する多様な数学的視点や問題解決の伝統をすべて捉えることにある。

「各国が、最も斬新で創造的な問題を集めた小冊子を持参しています」と、この論文の筆頭著者でありMITの博士課程学生であるShaden Alshammariは語る。「各国は小冊子を互いに交換し合っていますが、誰もそれらを収集し、整理してオンラインにアップロードする努力をしてきませんでした。」

MathNetを構築するには、12以上の言語で書かれたデジタル文書や数十年古いスキャン資料を含む、合計2万5000ページ以上におよぶ1,595冊のPDF資料を特定して収集する必要があった。そのアーカイブの大きな部分は、予想外のソースから得られた。IMO（International Mathematical Olympiad）コミュニティの長年の関係者であり共同著者であるNavid Safaeiは、2006年から手作業でこれらの小冊子を収集・スキャンし続けてきたのだ。彼の個人的なアーカイブが、このデータセットの大部分を支える基盤となった。

ソース（出典）の質も、規模と同様に重要である。既存の多くの数学データセットがArt of Problem Solving（AoPS）のようなコミュニティフォーラムから問題を抽出するのに対し、MathNetは公式の国内競技用ブックレットのみからデータを収集している。それらのブックレットに記載されている解答は専門家によって作成され、ピアレビュー（peer-reviewed）を経ており、しばしば複数ページにわたる。著者は同じ問題に対して複数のアプローチを順を追って解説している。この深みは、コミュニティ由来のデータセットに典型される短く非公式な解答よりも、AIモデル（AI models）が数学的推論（mathematical reasoning）を学習する際に遥かに豊かなシグナルを与える。また、このデータセットが学生にとって本当に有用であることを意味する。IMO（国際数学オリンピック）や国内競技の準備をする者なら誰でも、世界中の伝統から厳選された高品質な問題と詳解（worked solutions）を一元化・検索可能なコレクションとして利用できるようになった。

「多くの学生が、一人で取り組んでいたことを覚えています。彼らの国には、こうした競技のトレーニングをしてくれる人が誰もいませんでした」と、学生時代自らIMOに出場したAlshammariは語る。「私たちが願っているのは、高品質な問題と解答を学べる一元化された場所を提供することです。」

チームは国際数学オリンピック（IMO）コミュニティに深いルーツを持っています。共著者であるスルタン・アルバラカティ氏は現在、IMOの理事会で職務を務めており、研究者たちはこのデータセットをIMO財団と直接共有するよう努めています。データセットを検証するため、彼らはアルメニア、ロシア、ウクライナ、ベトナム、ポーランドなど各国から30人以上の評価者を集め、協力して数千件の解答を検証しました。

「MathNetデータベースは、新たな問題に取り組むことを求める学生や指導者、あるいは難問の解答を探している人々にとって、優れたリソースとなる可能性があります」と、スイスIMOの副リーダーであるタニッシュ・パティル氏は語ります。「他のオリンピック問題のアーカイブが存在することは事実です（特にAoPS上のContest Collectionsフォーラムなど）が、これらのリソースには、トピックや理論に必要な標準化されたフォーマットシステム、検証済みの解答、重要な問題メタデータが欠けています。また、このデータセットが推論モデル（reasoning models）の性能向上にどのように活用されるか、そして新しいオリンピック問題を作成する際に「その問題が本当にオリジナルかどうかを判定できる」という重要な課題に、私たちが間もなく確実に答えられるようになるかどうかも興味深いですね。」

MathNetはAIのパフォーマンスに対する厳格なベンチマーク（benchmark）としても機能しており、その結果は、最近のAIの数学的能力に関する見出しが示唆するよりも、より複雑な状況を示している。フロンティアモデル（Frontier models）は著しい進歩を遂げており、一部はIMOで金メダルレベルの成績を収めていると報じられており、標準的なベンチマークでは、多くの人間を立ち往生させるような問題を現在解くことができる。しかし、MathNetは進歩が不均等であることを示している。試験されたモデルの中で最高パフォーマンスを記録したGPT-5でさえ、6,400問からなるMathNetの主要ベンチマークにおいて平均約69.3％を記録し、オリンピックレベルの問題のほぼ3分の1で失敗している。また、問題に図が含まれている場合、パフォーマンスは全体的に大幅に低下し、最も能力の高いモデルでさえ一貫して視覚的推論（visual reasoning）が弱点であることを浮き彫りにしている。

複数のオープンソースモデル（open-source models）がモンゴル語の問題で0％のスコアを記録しており、全体的な強さにもかかわらず現在のAIシステムが劣っている別の次元を浮き彫りにしている。

「GPTモデルは英語と他の言語で同等に優れています」とAlshammari氏は述べる。「しかし、モンゴル語のような一般的でない言語では、オープンソースモデルの多くが完全に失敗しています。」

MathNetの多様性は、AIモデルが数学を学習する際のより深い限界に対処するために設計されています。学習データが英語と中国語の問題に偏ると、モデルは数学文化の狭い一部のみを吸収してしまいます。ルーマニアの組合せ論（combinatorics）の問題やブラジルの数論（number theory）の問題は、全く異なる角度から同じ基礎概念に迫る場合があります。研究者たちは、そのような幅広い問題に触れることが、人間とAIシステムの両方をより優れた数学的思考者にすると主張しています。

問題解決能力を超えて、MathNetは「2つの問題が同じ基礎的な数学的構造を共有しているかどうかをモデルが認識できるか」という問いを投げかける検索ベンチマーク（retrieval benchmark）を導入しています。この能力は、AIの開発と数学コミュニティの両方にとって重要です。実際の国際数学オリンピック（IMO）試験では長年にわたりほぼ重複する問題が出題されてきましたが、これは異なる表記法、言語、形式にわたる数学的同等性（mathematical equivalences）の発見が、熟練した人間の委員会であっても本当に難しいためです。8つの最先端のエンベッディングモデル（embedding models）をテストした結果、研究者たちは最も強力なモデルでさえ、最初の試行では正しい一致を約5％の確率で見つけ出しただけであり、モデルは構造的に関連のない問題を同等のものよりも類似度が高いと頻繁にランク付けしていることがわかりました。

このデータセットには、検索拡張生成（retrieval-augmented generation）のベンチマークも含まれており、新しい問題を解く前に構造的に関連する問題を与えると、モデルの性能が向上するかをテストしています。結果として性能は向上しますが、取得された問題が本当に関連性がある場合に限られます。DeepSeek-V3.2-Specialeは、適切にマッチした取得により最大12パーセントポイントの向上を示しましたが、無関係な問題を取得した場合、約22％のケースで性能が低下しました。

この論文は、AlshammariがSafaei、HUMAINのAIエンジニアであるAbrar Zainal、KAUST AcademyディレクターのSultan Albarakati、そしてMIT CSAILの同僚たち（修士課程学生のKevin Wen SB ’25、MicrosoftシニアエンジニアリングマネージャーのMark Hamilton SM ’22、PhD ‘25、教授のWilliam FreemanおよびAntonio Torralba）と共同で執筆しました。この研究は一部、Schwarzman College of Computing Fellowshipおよび国立科学財団（National Science Foundation）によって資金提供されました。

MathNetはmathnet.csail.mit.eduで一般公開されています。

原文を表示

Every year, the countries competing in the International Mathematical Olympiad (IMO) arrive with a booklet of their best, most original problems. Those booklets get shared among delegations, then quietly disappear. No one had ever collected them systematically, cleaned them, and made them available, not for AI researchers testing the limits of mathematical reasoning, and not for the students around the world training for these competitions largely on their own.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the company HUMAIN have now done exactly that.

MathNet is the largest high-quality dataset of proof-based math problems ever created. Comprising more than 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions, it is five times larger than the next-biggest dataset of its kind. The work will be presented at the International Conference on Learning Representations (ICLR) in Brazil later this month.

What makes MathNet different is not only its size, but its breadth. Previous Olympiad-level datasets draw almost exclusively from competitions in the United States and China. MathNet spans dozens of countries across six continents, covers 17 languages, includes both text- and image-based problems and solutions, and spans four decades of competition mathematics. The goal is to capture the full range of mathematical perspectives and problem-solving traditions that exist across the global math community, not just the most visible ones.

"Every country brings a booklet of its most novel and most creative problems," says Shaden Alshammari, an MIT PhD student and lead author on the paper. "They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online."

Building MathNet required tracking down 1,595 PDF volumes totaling more than 25,000 pages, spanning digital documents and decades-old scans in more than a dozen languages. A significant portion of that archive came from an unlikely source: Navid Safaei, a longtime IMO community figure and co-author who had been collecting and scanning those booklets by hand since 2006. His personal archive formed much of the backbone of the dataset.

The sourcing matters as much as the scale. Where most existing math datasets pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets. The solutions in those booklets are expert-written and peer-reviewed, and they often run to multiple pages, with authors walking through several approaches to the same problem. That depth gives AI models a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community-sourced datasets. It also means the dataset is genuinely useful for students: Anyone preparing for the IMO or a national competition now has access to a centralized, searchable collection of high-quality problems and worked solutions from traditions around the world.

"I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition," says Alshammari, who competed in the IMO as a student herself. "We hope this gives them a centralized place with high-quality problems and solutions to learn from."

The team has deep roots in the IMO community. Sultan Albarakati, a co-author, currently serves on the IMO board, and the researchers are working to share the dataset with the IMO foundation directly. To validate the dataset, they assembled a grading group of more than 30 human evaluators from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated together to verify thousands of solutions.

"The MathNet database has the potential to be an excellent resource for both students and leaders seeking new problems to work on or looking for the solution to a difficult question," says Tanish Patil, deputy leader of Switzerland's IMO. "Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and important problem metadata that topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and if we will soon be able to reliably answer an important issue when creating novel Olympiad questions: determining if a problem is truly original."

MathNet also functions as a rigorous benchmark for AI performance, and the results reveal a more complicated picture than recent headlines about AI math prowess might suggest. Frontier models have made extraordinary progress: Some have reportedly achieved gold-medal performance at the IMO, and on standard benchmarks they now solve problems that would stump most humans. But MathNet shows that progress is uneven. Even GPT-5, the top-performing model tested, averaged around 69.3 percent on MathNet's main benchmark of 6,400 problems, failing nearly one-in-three Olympiad-level problems. And when problems include figures, performance drops significantly across the board, exposing visual reasoning as a consistent weak point for even the most capable models.

Several open-source models scored 0 percent on Mongolian-language problems, highlighting another dimension where current AI systems fall short despite their overall strength.

"GPT models are equally good in English and other languages," Alshammari says. "But many of the open-source models fail completely at less-common languages, such as Mongolian."

The diversity of MathNet is also designed to address a deeper limitation in how AI models learn mathematics. When training data skews toward English and Chinese problems, models absorb a narrow slice of mathematical culture. A Romanian combinatorics problem or a Brazilian number theory problem may approach the same underlying concept from a completely different angle. Exposure to that range, the researchers argue, makes both humans and AI systems better mathematical thinkers.

Beyond problem-solving, MathNet introduces a retrieval benchmark that asks whether models can recognize when two problems share the same underlying mathematical structure, a capability that matters both for AI development and for the math community itself. Near-duplicate problems have appeared in real IMO exams over the years because finding mathematical equivalences across different notations, languages, and formats is genuinely hard, even for expert human committees. Testing eight state-of-the-art embedding models, the researchers found that even the strongest identified the correct match only about 5 percent of the time on the first try, with models frequently ranking structurally unrelated problems as more similar than equivalent ones.

The dataset also includes a retrieval-augmented generation benchmark, testing whether giving a model a structurally related problem before asking it to solve a new one improves performance. It does, but only when the retrieved problem is genuinely relevant. DeepSeek-V3.2-Speciale gained up to 12 percentage points with well-matched retrieval, while irrelevant retrieval degraded performance in roughly 22 percent of cases.

Alshammari wrote the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues: master's student Kevin Wen SB ’25; Microsoft Principal Engineering Manager Mark Hamilton SM ’22, PhD ‘25; and professors William Freeman and Antonio Torralba. Their work was funded, in part, by the Schwarzman College of Computing Fellowship and the National Science Foundation.

MathNet is publicly available at mathnet.csail.mit.edu.

この記事をシェア

AWS Machine Learning Blog★32026年3月19日 00:54

AIエージェントを本番環境で評価する：Strands Evals実践ガイド

Strands Evalsは、従来のテストでは対応できないAIエージェントの本番環境導入時の課題を解決する実践的評価ガイドを提供する。AIエージェントは柔軟性と文脈認識能力が高い反面、体系的評価が困難な特性を持つため、確定的出力を前提とする従来テスト手法では不十分であることを指摘している。

AI News★42026年2月27日 22:15

金融ワークフロー向けエージェントAIの信頼性向上

企業が顧客対応や事務作業に自動エージェントを導入する中、金融機関は特に多段階シナリオでの一貫性・説明可能性の課題を解決し、エージェントAIへの信頼向上を優先している。

Surge AI Blog★42022年6月13日 09:00

Surge AIがOpenAIの8,500問の数学問題データセットGSM8Kを構築した方法

OpenAIのために8,500問の小学校レベルの数学問題データセットを構築し、GPT-3などの言語モデルが自然言語の数学問題を解く能力と推論力を測定・向上させることを目的としています。

ニュース一覧に戻る元記事を読む