GitHub Blog·2026年6月16日 04:17·約8分で読める

多言語 AI を構築する研究者・開発者を加速させる新オープンデータセットの公開

#多言語 AI #オープンデータセット #自然言語処理 (NLP)#GitHub #LLM 評価

TL;DR

GitHub は、非英語圏の開発者コミュニティの多言語協働を支援するメタデータベースの新規オープンデータセット「GitHub Multilingual Repositories Dataset」を公開した。

AI深層分析2026年6月16日 05:02

重要/ 5段階

深度40%

キーポイント

非英語開発コンテンツの可視化

README、Issue、Pull Request の各セクションにおける言語分布の違い（例：韓国語は Issue で最も多いが README では 5 位）を分析し、非英語圏のリポジトリ発見を支援する。

高品質なメタデータ構造

8000 万件以上の分類行を含むこのデータセットは、リポジトリの内容そのものではなく、多言語協働の証拠となるメタデータを CC0-1.0 ライセンスで提供している。

複数クラッフィアの並列活用

fastText、gcld3、lingua-py の 3 つの異なる言語分類モデルの結果を統合せず個別に公開し、研究者が精度（Precision）と再現率（Recall）のバランスを自由に選定できる。

欧州デジタルコミットメントの実践

マイクロソフトが 2025 年に掲げた「多言語データのアクセシビリティ向上」へのコミットメントの一環として、オープンソース AI 開発者向けに公開された。

データセットの適切な用途と限界

このデータセットは言語識別の絶対的な基準（ground-truth）ではなく、ユーザーが精度と再現率のトレードオフを選べる透明な発見ツールとして設計されています。また、リポジトリレベルのメタデータであり、開発者やコミュニティの機密属性を推測するために使用すべきではありません。

オープンデータによる多言語格差の解消

多くの欧州言語がオンラインテキストで過小評価されている現状に対し、開発者の協業特有の文脈（README や Issue など）を含むオープンデータを公開することで、AI ツールの公平性を高め、取り残されるコミュニティを防ぎます。

次なるステップと国際会議

6 月 16 日にストラスブールで開催される「Open Innovation Dialogue Hub」で、このデータセットの意義や多言語 AI におけるオープンデータの重要性について議論されます。

影響分析・編集コメントを表示

影響分析

このデータセットは、英語中心のバイアスが存在する現在の AI コーディングツールやドキュメント生成モデルに対する評価・改善の基盤となる重要なリソースです。特に低資源言語における開発コミュニティの実態を定量的に把握できるため、より公平で包括的な次世代 AI ツールの開発を加速させる契機となります。

編集コメント

英語圏中心の AI 開発環境における言語格差是正に向けた、実用的かつ戦略的な一歩です。特に低資源言語への対応を強化する研究者にとって、即座に活用可能な貴重なインフラと言えます。

ソフトウェアはプログラミング言語で書かれるかもしれませんが、開発者間の協働の核心には人間语言があります。開発者は README でプロジェクトの仕組みを説明し、イシューで助けを求め、プルリクエストでコードのレビュー、議論、改善を行います。この協働は英語で行われることが多いですが、常にそうとは限りません。AI がソフトウェア構築における役割をますます大きくしていく中で、多言語の開発者コンテンツはこれまで以上に重要になっています。

本日、GitHub は GitHub Multilingual Repositories Dataset を公開します。これは、非英語の自然言語コンテンツの証拠がある公開 GitHub リポジトリを研究者や開発者が発見できるよう設計されたリポジトリレベルのメタデータセットです。このセットを作成する過程で、README、イシュー、プルリクエスト間で言語分布が異なることがわかりました。例えば、イシューテキストでは韓国語が最も一般的な非英語言語ですが、README では第 5 位です。ポルトガル語は非英語の README リストでトップに立ち、300 万を超えるリポジトリを占めています。

このデータセットは現在、CC0-1.0 ライセンスの下で GitHub で利用可能です。これは、2025 年に Microsoft の欧州デジタルコミットメントの一環として行った、オープンソース AI 開発者を含む多言語データのアクセス性を高めるという約束の実行です。

データセットの内容

GitHub Multilingual Repositories Dataset は、リポジトリの内容をそのままダンプしたものではありません。代わりに、多言語での協働が行われている可能性のあるリポジトリを開発者や研究者が検索できるよう支援するメタデータセットです。このデータセットは、4,000 万を超えるリポジトリにわたる 8,000 万件以上の分類行をカバーしています。各公開リポジトリについて、以下の情報を提供します:

README ファイル、最もコメントの多い issue、および最もコメントの多い pull request の言語分類を提供し、それぞれ最初の 150 文字を入力サンプルとして使用します。20 文字未満のテキストは除外されます。

各テキストソース（fastText, gcld3, lingua-py）ごとの分類と、それぞれの信頼度スコアを提供します。このデータセットには、信頼度が 0.5 を超える分類のみが含まれます。

リポジトリメタデータ：作成タイムスタンプ、ディスク使用量、スター数、フォーク数、主要プログラミング言語、SPDX ライセンス、issue および pull request のカウント、およびスナップショット日付です。

我々は意図的に、3 つの分類器を単一のラベルに統合しませんでした。異なる分類器は、特にリソースが限られた言語において、異なるカバレッジと信頼度キャリブレーションを持っています。すべての 3 つを公開することで、厳密さをどの程度にするかをユーザー自身が判断できるようにしています。高精度なギリシャ語のサブセットが必要ですか？ある信頼度閾値以上で 3 つの分類器すべてが合意することを要求してください。ロマンス諸言語の探索的研究のために広い再現性を必要としますか？1 つの分類器だけで十分かもしれません。

これを使って何ができるか

このデータセットは、一般的なウェブテキストでは困難な種類の作業を目的として設計されています:

特定の言語における開発者ドキュメントやコラボレーションが含まれている可能性のあるリポジトリを発見してください。

非英語圏の開発者コミュニティが、イシュー、プルリクエスト、README をどのように使用しているかを研究してください。

多言語にわたって適切に動作する必要がある AI コーディングツール、ドキュメント生成器、またはレビュー支援ツールのための評価セットを構築してください。

開発者の豊かな多言語的多様性に関するデータに基づく論拠を用いて、意思決定者が新しい開発者ツールや AI 機能の言語カバレッジを拡大することを促してください。

オープンソースにおける欧州諸語およびその他の少数派言語の代表性を測定してください。

いくつかの注意点

言語識別は困難です。特にソフトウェアリポジトリにおいてはそうです。リポジトリ内のテキストはしばしば短く、バッジ、テンプレート、インストールコマンド、コードスニペット、ユーザー名、または混合された言語コンテンツを含む可能性があります。150 文字のサンプルがリポジトリ全体を代表するとは限りません。また、分類器のカバレッジや較正も言語によって異なり、特に低資源言語においては顕著です。

そのため、このデータセットは言語識別のための絶対的な真値ベンチマークとして扱うべきではありません。むしろ、これは透明性の高い発見ツールとして設計されています。ユーザーは分類結果、信頼度スコア、およびソースを検査し、自身の研究や開発ワークフローに適合する精度と再現率のトレードオフを選択できます。

また、このデータセットを使用して、リポジトリの所有者、貢献者、またはコミュニティに関する機微な属性を推測してはなりません。これらのシグナルはリポジトリレベルのメタデータであり、個人レベルの属性ではありません。

なぜ多言語オープンデータが重要なのか

現在、AI システムの構築と評価に用いられるオンラインテキストにおいて、多くのヨーロッパ諸語は依然として過小表現されています。これは、AI ツールが一部の開発者、言語、コミュニティに対しては良好に機能する一方で、他の人々を取り残すリスクを生み出します。オープンデータはこの格差を埋める手助けとなります。私たちはこのデータセットを構築しました。なぜなら、開発者向けのコンテンツは一般的なウェブテキストとは異なるからです。README ファイル、イシュー（課題）、プルリクエストには、ソフトウェア協働における言語が詰まっています：インストール手順、バグ報告、機能要望、レビューコメント、そしてコミュニティの規範です。こうした文脈は、開発者が実際にどのように作業を行っているかをよりよく理解できる AI システムを構築する上で役立ちます。

多言語の開発者向けコンテンツシグナルを検出し分析しやすくすることで、このデータセットは研究者、オープンソース開発者、モデルビルダーに対し、ソフトウェア開発における言語表現を研究するための新たなツールを提供します。これにより、格差の特定が可能となり、より良い評価を支援し、ヨーロッパ内外の開発者向けのより包括的な AI ツールを導く情報となります。また、これはより広範な原則を反映しています：開発者向けに AI を構築する際には、開発者が実際に使用するコミュニティ、言語、ワークフローを含めるべきです。

今後の展望

6 月 16 日、ストラスブールで開催されるオープン・イノベーション・ダイアログ・ハブにおいて、本データセットと多言語 AI におけるオープンデータの広範な重要性について議論いたします。本イベントはマイクロソフト・オープン・イノベーション・センター、欧州評議会、GitHub が共催し、政策決定者、研究者、文化機関、そしてオープン・イノベーションのリーダーが集まり、AI、言語的多様性、文化的遺産、およびオープンデータについて議論する場となります。

多言語 AI には、多言語の開発者コミュニティが必要です。本データセットが、より多くの人々がこれらのコミュニティを研究し、支援し、構築するための手助けとなることを願っています。GitHub で CC0-1.0 ライセンスの下で公開することで、研究者、オープンソースのメンテナ、モデルビルダーに対して、このデータセットを利用し、批判的に検討し、拡張し、その上に評価セットやツールを構築することを呼びかけています。

もし本データセットを用いて興味深い取り組みを行った場合は、ぜひお知らせください。

本記事「新しいオープンデータセットで多言語 AI を構築する研究者と開発者を加速」は、最初に The GitHub Blog に掲載されました。

原文を表示

Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever.

Today, GitHub is publishing the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories.

The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers.

What’s in the dataset

The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories. For each public repository, we provide:

Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample. We exclude texts under 20 characters.

Classifications for each text source, from fastText, gcld3, and lingua-py, each with a confidence score. The dataset only includes classifications with >0.5 confidence.

Repository metadata: creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.

We deliberately did not collapse the three classifiers into a single label. Different classifiers have different coverage and confidence calibration, especially for lower-resource languages. By exposing all three, we let you decide how strict you want to be. Want a high-precision Greek subset? Require all three classifiers to agree above some confidence threshold. Want broad recall for an exploratory study of Romance languages? One classifier may be enough.

What you can build with it

The dataset is designed for the kind of work that’s hard to do with general web text:

Discover repositories likely to contain developer documentation or collaboration in specific languages.

Study how non-English developer communities use issues, pull requests, and READMEs.

Build evaluation sets for AI coding tools, doc generators, or review assistants that need to behave well across languages.

Encourage decision-makers to expand language coverage for new developer tools and AI features using data-backed arguments on the rich multilingual diversity of developers.

Measure representation of European and other underrepresented languages in open source.

Some caveats

Language identification is hard, especially in software repositories. Repository text is often short. It may include badges, templates, installation commands, code snippets, usernames, or mixed-language content. A 150-character sample may not represent the whole repository. Classifiers also vary in coverage and calibration, especially for lower-resource languages.

That is why the dataset should not be treated as a ground-truth benchmark for language identification. Instead, it is designed as a transparent discovery tool. Users can inspect classifications, confidence scores, and sources, then choose the precision and recall tradeoffs that fit their own research or development workflow.

The dataset also should not be used to infer sensitive attributes about repository owners, contributors, or communities. The signals are repository-level metadata, not person-level attributes.

Why open multilingual data matters

Today, many European languages remain underrepresented in the online text used to build and evaluate AI systems. That creates a risk that AI tools work well for some developers, languages, and communities, while leaving others behind. Open data can help close that gap. We built this dataset because developer content is different from general web text. READMEs, issues, and pull requests contain the language of software collaboration: installation instructions, bug reports, feature requests, review comments, and community norms. That context can help build AI systems that better understand how developers actually work.

By making multilingual developer-content signals easier to find and analyze, this dataset gives researchers, open source developers, and model builders another tool for studying language representation in software development. It can help identify gaps, support better evaluation, and inform more inclusive AI tools for developers across Europe and beyond. It also reflects a broader principle: Building AI for developers should include the communities, languages, and workflows developers actually use.

What’s next

We’ll be discussing the dataset, and the broader importance of open data for multilingual AI, at the Open Innovation Dialogue Hub in Strasbourg on June 16. The event is co-organized by the Microsoft Open Innovation Center, the Council of Europe, and GitHub, and will bring together policymakers, researchers, cultural institutions, and open innovation leaders to discuss AI, linguistic diversity, cultural heritage, and open data.

Multilingual AI needs multilingual developer communities. We hope this dataset helps more people study, support, and build for them. By releasing it under CC0-1.0 on GitHub, we’re inviting researchers, open source maintainers, and model builders to use it, critique it, extend it, and build evaluation sets and tools on top of it.

If you do something interesting with it, we’d love to hear about it.

The post Accelerating researchers and developers building multilingual AI with a new open dataset appeared first on The GitHub Blog.

この記事をシェア

TLDR AI★32026年6月16日 09:00

多言語 AI を構築する研究者や開発者を加速させる新オープンデータセットの発表

TLDR AI が、多言語 AI の構築を支援する新しいオープンデータセットを発表し、研究者や開発者の作業効率向上を図っている。

The Verge AI★42026年5月28日 17:49

YouTube、AI にカスタム動画フィード作成を指示可能に

YouTube が新機能として、ユーザーの要望や興味に基づいて AI がパーソナライズされた動画フィードを作成し、ホーム画面の上部に固定できる機能を導入する。

The Verge AI★42026年5月13日 02:00

Android 17 の主要新機能 9 選

Google は Android 17 で、音声入力やバイブコードウィジェットを含む AI 機能の強化に加え、絵文字の刷新や迷惑アプリを回避するスクリーンタイムツールの導入を発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む