GitHub Blog·2026年5月9日 00:00·約21分

GitHub イノベーショングラフデータを用いた各国の「デジタル複雑性」解明への研究者の取り組み

#オープンソース #経済複雑性 #データ分析 #マクロ経済予測

TL;DR

GitHub の Innovation Graph データを用いた研究により、従来の経済指標では捉えきれなかった「デジタル複雑性」が国家の成長や不平等を予測する新たな指標として確立された。

AI深層分析2026年5月9日 01:05

重要/ 5段階

深度40%

キーポイント

デジタル複雑性の定量化

GitHub の Innovation Graph データ（開発者の地理的分布と言語使用）に経済複雑性指数（ECI）を適用し、コードの生産性を国家レベルで測定する手法が確立された。

従来の指標の盲点の克服

物理的な輸出や特許データでは見落とされていた「ソフトウェア」という経済要素（デジタルダークマター）を可視化し、より正確な国家評価を実現した。

予測精度の向上

この新しい指標は、GDP 成長率、格差、排出量などのマクロ経済特性を、従来の貿易データや特許データよりも優れた精度で予測できることが示された。

ソフトウェアの経済複雑性指数（ECI）の新規性

GitHub Innovation Graph データに基づくソフトウェア ECI は、貿易や特許などの従来の指標では捉えきれない情報を明らかにし、GDP や所得格差の変動を説明する上で重要な役割を果たす。

ソフトウェア分野における「関連性の原理」

国々は物理経済と同様に、既存の技術スタックに関連する分野へと多角化を進める傾向があり、ソフトウェア専門分野への移行もランダムではないことが示された。

プログラミング言語から「ソフトウェアバンドル」への変換手法

個別の言語ではなく、リポジトリ内での共起関係に基づいて計算されたコサイン類似性と階層的クラスタリングを用い、150 の言語を 59 の一貫性のある技術スタック（ソフトウェアバンドル）にグループ化した。

経済複雑性指数（ECI）の計算手法

各国の開発者分布から比較優位を算出し、普遍的でない分野に特化している国ほど高いスコアを得る。関連性分析では、特定の技術スタック（バンドル）を得意とする国が他の特定のスタックも得意にする傾向に基づき、隣接する分野への参入可能性を評価する。

影響分析・編集コメントを表示

影響分析

この研究は、オープンソースコードの生産性を国家経済分析に組み込むことで、マクロ経済予測のパラダイムシフトをもたらす可能性があります。特に、デジタル化が進む現代において、物理的な貿易データだけでは見えない経済構造を解明する新たなツールとして、政策立案や投資判断に重要な示唆を与えるでしょう。

編集コメント

GitHub の技術データが国家経済の「複雑性」を測る指標として再評価された画期的な事例です。従来の貿易統計では見えていなかったデジタル経済の実態を浮き彫りにする、非常に示唆に富む分析です。

GitHub Innovation Graph の目標の一つは、オープンソースソフトウェアの経済的影響や開発者間の協働に関する研究を促進することでした。最近『Research Policy』誌に掲載された論文では、4 人の研究者がまさにこの目的のために Innovation Graph データを活用しました。これらの研究者へのインタビューと、2025 年第 4 四半期のデータリリースについて共有できることを嬉しく思います。

『Research Policy』誌の論文は、GitHub におけるオープンソースソフトウェア生産の地理的分布が、国家の「デジタル複雑性」を明らかにするかどうか、そしてその複雑性が従来の経済データでは捉えきれない GDP（国内総生産）、不平等、および排出量（emissions）を予測できるかどうかを検証しています。

4 人の研究者をご紹介します：

Sándor Juhász はブダペストの Corvinus University の研究フェローです。彼の専門は経済地理学、知識ネットワーク、そして空間構造がイノベーションにどのように影響を与えるかという点にあります。

Johannes Wachs は、Corvinus University of Budapest の准教授であり、Corvinus Institute of Advanced Study における集合学習センターのディレクター、さらにウィーンの Complexity Science Hub の研究者でもあります。彼の研究は計算社会科学と経済地理学の交差点に位置し、特にオープンソースソフトウェアコミュニティに焦点を当てています。

ジェロメイン・カミンスキーは、マーストリヒト大学ビジネス経済学部の准教授です。彼の研究は起業家精神、戦略、因果機械学習に専門化しており、データ駆動型手法が意思決定とイノベーションをどのように改善できるかに焦点を当てています。彼は因果データサイエンス会議の共同創設者でもあります。

セサル・A・ヒダルゴは、トゥールーズ経済大学およびブダペストコルヴィヌス大学の教授であり、集合学習センターの所長です。また、経済複雑性観測所の創設者であり、DataWheel の共同創設者でもあります。

研究 Q&A

ケビン：皆さん、お話しできて本当にありがとうございます！この論文について、読者の皆様向けに簡単なハイレベルな要約をいただけますか？

サンドル：過去 15 年ほど、経済学者たちは、各国が輸出する物理製品、出願する特許、発表する研究を分析することで、国家経済の複雑性を測定してきました。これらの指標は、どの国が成長するか、不平等度が高いかなど、多くのマクロ経済的特徴を予測する上で驚くほど優れた結果を示します。しかし、それらすべてに大きな盲点があります：ソフトウェアです。

ジェロメイン：コードは関税を通りません。それは「git push」やクラウドサービス、パッケージマネージャーを通じて国境を越えます。つまり、生産的な知識のすべてが本質的に不可視であり、一部の同僚が経済の「デジタルダークマター」と呼んでいるものです。私たちはこれを GitHub Innovation Graph を用いて解決しました。これは IP アドレスに基づき、各経済圏内の開発者がどのプログラミング言語でコードをプッシュしたかを追跡するものです。私たちはこのデータに経済複雑性指数（Economic Complexity Index: ECI）を適用しました。結論として、ソフトウェアの ECI は、貿易フロー、特許、研究データが部分的に見落としがちな新たな情報を浮き彫りにします。特に、ソフトウェアの ECI は、従来のすべての指標で統制した後も、1 人あたりの GDP や所得格差の変動を説明するのに役立ちます。

ヨハネス：また、国々はソフトウェアの専門分野の間でランダムに飛び移るわけではないことも発見しました。彼らは、すでに実施している技術に関連する技術スタックへと多角化します。これは、物理経済圏の国々がすでに輸出している製品と類似した製品へと移行する傾向があるのと全く同じです。これを「関連性の原理」と呼びますが、ソフトウェアにおいてもこの原則は成立します。

ケビン：興味深いですね！分析で使用された手法の概要を教えてください。

ヨハネス：はい。ご指摘の通り、中核となるデータは GitHub Innovation Graph から得られており、これは 2020 年から 2023 年にかけて、163 の経済圏と 150 のプログラミング言語において、四半期ごとにコードをプッシュする開発者の数を示すものです。しかし、個々のプログラミング言語自体が適切な単位とは限りません。実際のソフトウェアでは、複数の言語がバンドル（束）として組み合わせて使用されるからです。Web アプリでは HTML、CSS、JavaScript が組み合わせられ、データサイエンスプロジェクトでは Python と Jupyter Notebook が用いられ、システムプログラミングでは C とアセンブリ言語がペアになります。

サンドール：そこで私たちは、2024 年にアクティブだったすべてのリポジトリを対象に GitHub GraphQL API をクエリして、同じリポジトリ内でどの言語が共起するかを特定する別個のデータセットを構築しました。言語間の類似度は、重み付き共起に基づいてコサイン類似度（cosine similarity）として計算し、20 以上の言語を含むポリグロット（多言語対応）リポジトリが信号を支配しないよう正規化スキームを適用した上で、階層的クラスタリング（hierarchical clustering）を用いて 150 の言語を 59 の「ソフトウェアバンドル」にグループ化しました。各バンドルは、一貫性のある技術スタックを表しています。

Jermain: …そこから先は、経済複雑性の「標準」パイプラインです。国ごとのバンドル行列を構築し、顕在化比較優位性を計算します。本質的には、「この国は、このバンドルにおいて開発者の割合が世界平均に対して不均衡に多いか？」と問います。その後、これを二値化し、反復法を適用して経済複雑性指数を算出します。多くの非普遍的なバンドルで専門化している国は高スコアとなり、誰もが行うことのみで専門化している国は低スコアとなります。関連性分析については、バンドル間の近接度を共専門化パターンを用いて定義します。もしバンドル A が得意な国が、同時にバンドル B も得意である傾向があるなら、それらのバンドルはソフトウェア空間において近い位置にあるとみなされます。その後、各国が既存の専門分野に近いバンドルに参入する可能性が高いかどうかを検証します。

Kevin: 素晴らしい！追伸質問ですが、分析で使用した手法について、「5 歳児にもわかるように」説明していただけますか？

セサル：国をキッチンに例えてみてください。あるキッチンは豊富な食材と道具、希少なスパイスから最高級の包丁に至るまで揃っているため、何でも作ることができます。一方、他のキッチンはより限定的かもしれません。米を炊くことや数種類の簡単な調理しかできない場合もあります。私たちはキッチンを直接見ることができないため、そこで作れる料理に基づいてその「複雑さ」を推測する必要があります。これが経済複雑度指数（ECI）が推定できることです。それが鶏肉とご飯の専門店なのか、高度な食用フォームやスフレを生産できる場所なのかを見ることで、キッチンで何が起こっているかを推測できます。元々これらの手法は貿易データに適用されており、そこで出てくる料理とは国の輸出を指していましたが、この論文ではそれをソフトウェアに応用しました。鶏肉とご飯の国とは、Python と JavaScript を扱う国です。ミシュラン星付きの国とは、航空宇宙や防衛分野向けの認証済み組み込みシステムをプログラムできる国です。

ソフトウェア経済複雑度による上位 20 位経済圏

順位経済圏ソフトウェア ECI

1 ドイツ 1.739

2 オーストラリア 1.730

3 カナダ 1.729

4 オランダ 1.727

5 フランス 1.702

6 アメリカ合衆国 1.695

7 ポーランド 1.691

8 イギリス 1.687

9 イタリア 1.672

10 スウェーデン 1.620

11 スイス 1.620

12 香港特別行政区 1.595

13 ノルウェー 1.571

14 日本 1.552

15 スペイン 1.552

16 ロシア 1.530

17 シンガポール 1.468

18 台湾 1.464

19 ベルギー 1.448

20 フィンランド 1.444

Kevin: ありがとうございます、非常に役立ちました。論文やデータの限界について、さらに研究を進めるためにあればよかったと思う点や、理想的なデータセットがどのようなものかについてお聞きしたいです。

Johannes: 大きな欠点の一つは、公開された GitHub の活動しか見えていないことです。つまり、 proprietary ソフトウェア（独自ソフトウェア）については全く把握できていません。そのため、クローズドソースの企業向け開発は確認できず、これは非常に大きな問題です。したがって、オープンソース文化が弱い国におけるソフトウェアの複雑さを、私たちの測定値は過小評価している可能性があります。

Sándor: 時間範囲ももう一つの制約です。4 年間のデータ（2020〜2023 年）は横断分析には十分ですが、経済的複雑性指標が本来設計されている長期成長予測を信頼できる形で検証するには短すぎます。経済構造は四半期単位ではなく、数十年かけて変化します。私たちはこのデータを 20 年間分持てればと思います。

Jermain: 夢のデータセットとは、GitHub のような活動データと、プロジェクト自体に関する情報を組み合わせるものです。単にプログラミング言語だけでなく、フレームワークやライブラリ、そしてソフトウェアが実際に何を行うかという情報も含める必要があります。この次元を考慮することは、私たちのプロジェクトにとって自然な次のステップであり、ソフトウェアのバンドルやユースケースについてより明確な洞察をもたらすでしょう。例えば、あるリポジトリがフィンテックアプリケーションを構築しているのか、ゲームエンジンを作っているのかを知ることができれば、はるかに微細な粒度で能力バンドルを定義できます。GitHub Topics はその一端を示しており、私たちは堅牢性のチェックとしてこれを使用しましたが、まだノイズが多く不完全です。

ケビン：将来についての予測はありますか？政策立案者への提言は？開発者への提言は？

セサル：ソフトウェアは産業政策の対象として興味深いものです。なぜなら、これは主に高度に流動的な人的資本（ソフトウェア開発者）に依存する業界だからです。原則として、人材誘致プログラムを通じてインセンティブを与えることで発展の機会を提供します。しかし実際には、ソフトウェア人材の高い流動性は両刃の剣となり得ます。データとの連携が困難にする消費者保護規制や、イノベーションのリスクを中小企業に分散させる労働者保護制度（紙面上は労働者を保護する法律だが、実際にはその責任を事業者に転嫁するものなど）に対して敏感になるからです。意図は良質でも設計が不十分な規制によってソフトウェア人材を窒息させずに誘致する方法を見出す国々が先行することになります。

ヨハネス：開発者にとって、地域ごとに生産されるソフトウェアの種類に高い専門性があることを理解することは、移住を検討する際に役立ちます。開発者は、ソフトウェア能力の製品空間表現を用いて、自身のスキルセットがどの国とよく適合するかを知ることができます。

ジェロメイン：先を見据えた大きな質問は、生成 AI がこの図景にどのような影響を与えるかです。AI コーディングアシスタントが新しいプログラミング言語での作業の障壁を下げた場合、関連性は弱まるのでしょうか？それとも各国がより迅速に多角化を進めるのでしょうか？あるいは、最も優れた AI インフラを持つ国が最大の恩恵を受けるため、既存の優位性が強化されるのでしょうか。私たちはこの点に取り組んでおり、ヨハネスと彼の同僚たちは、GitHub における AI 支援コーディングの世界的な拡散を追跡する新しい論文を『Science』誌に発表しています。この答えは、今後 5 年以内にデジタル複雑性（digital complexity）に関する私たちの考え方を再構築するものになるでしょう。さらに考慮すべき点として、ソフトウェアやソフトウェアバンドルが NAICS や NACE の産業分類コードとしてどのように表現されるかという問題があります。

サンドール：私は予測を加えさせていただきます。ソフトウェアデータに基づく経済複雑性指数が、今後 10 年以内に政策決定者のツールの標準的な一部となり、従来の貿易ベースの指標と並んで位置づけられるようになると思います。このデータは公開されており、四半期ごとに更新され、従来のデータでは捉えることができない何かを把握できるからです。

個人への質問（Q&A）

ケビン：少し話題を変え、二人の個人的な物語についてもっとお話ししたいです。ヨハネスさん、計算社会科学やネットワーク科学のバックグラウンドをお持ちで、これは伝統的な経済学の道とは少し異なるようですが、研究に至るまでの経緯を詳しく教えてください。

ヨハネス：実は私は数学から始め、ブダペストの中央ヨーロッパ大学で博士号を取得する間に計算社会科学へと移行しました。デジタルデータが痕跡として残す機会に人間行動を研究する上でどのような可能性が開かれるかに魅了されました。ネットワーク手法を使うのが好きなのは、こうした痕跡に見られるミクロレベルでの活動や相互作用と、マクロな結果の間を行き来するのに役立つからです。特にオープンソース研究の世界に入ったのは、GitHub データが貴重な知識生産の極めて豊富で公開された記録であることに気づいた時でした。しかし、このデータを社会科学の問題を研究するために利用している人はほとんどいませんでした。

ケビン：サンドルさん、経済地理学のバックグラウンドをお持ちですね。これは計算社会科学に比べるとより伝統的なルートです。ソフトウェアデータを取り扱う道へはどのようにたどり着かれたのですか？

サンドル：私はユトレヒト大学で経済地理学の博士号を取得しました。その研究コミュニティではすでに経済複雑性（economic complexity）を用いて地域開発を研究していました。そのため、都市や地域、産業といった場所を、ネットワークと能力の蓄積というレンズを通して考えるよう訓練されました。

ケビン：ジェルマンさん、学術的なトレーニングと並行していくつかの起業家プロジェクトを通じて実践的な技術的専門知識を身につけられたようです。

Jermain: RWTH アーヘン大学での博士課程在籍中、私は MIT の Césár 氏のもとで訪問研究者として活動していました。その間、同僚と Moviegalaxies.com（オープンデータ）というプロジェクトにも携わり、後には Kickstarter プロジェクトにおけるテキスト、音声、動画データの分析に取り組みました。これが私の最初のマルチモーダル機械学習パイプラインでした。ネットワーク分析のプロジェクトを通じて、私は偶然にもドイツの大規模サッカーチームのパスネットワークの分析を行うことになりました。最近の研究は主に因果関係と因果的機械学習（causal machine learning）に関わっています。この立場で、同僚の Paul Hünermund 氏と共に Causal Data Science ミーティングを共同設立しました。

Kevin: Césár さん、物理学がご専門という認識でよろしいでしょうか？

Césár: 私は物理学からスタートし、ノートルダム大学で複雑ネットワーク（complex networks）に焦点を当てた博士号を取得しました。その期間中、ネットワークのツールを用いて経済の進化と運命を記述できることに気づきました。やがてそれは、今日私たちが「経済的複雑性（economic complexity）」と呼んでいる分野へと発展し、物理学・経済学・コンピュータサイエンスのツールを活用して経済開発のプロセスを研究する領域となりました。

Kevin: 自分が情熱を注げるニッチな分野を見つけることは本当に喜びであり、その分野で生活することについてどうお感じか、また日々の様子はどうなのかとても興味があります。

ヨハネス：正直に言うと、研究における日々の業務は、コードの作成、論文の執筆、人々との対話、そしてそれらの反復という組み合わせです。もちろん、大学で働く場合は通常、教育や事務作業も伴います。私は自分が取り組む内容についてある程度の自由度があることを気に入っています。プロジェクトや方向性が楽しさを生み出さない場合、通常は焦点をシフトすることができます。これはユニークな点です。

サンドル：このニッチ分野の最も素晴らしい部分の一つに、学際的なコミュニティがあると付け加えたいと思います。任意の一週間で、経済地理学者、コンピュータサイエンティスト、物理学者と、同じ研究課題について話すことがあります。これは珍しく、非常に刺激的です。

ケビン：生成 AI ツールの登場以来、状況は変化しましたか？生成 AI ツールが役立つと感じていますか？

ヨハネス：間違いなくあります。現在、データパイプラインのデバッグ（debugging）、ボイラープレートコード（boilerplate code）のドラフト作成、さらには統計アプローチの妥当性チェックなど、LLM 言語モデルツールを定期的に使用しています。特に、多くの異なる手法があり、チームで作業を調整する必要があるプロジェクトでは非常に有用です。ただし、すでに明確なアイデアを持っている場合にのみ、LLM はより役立ちます。

ケビン：ソフトウェアエンジニアリングや研究の初心者に対して、何かアドバイスはありますか？10 年前の自分自身 younger version として、どのようなヒントを与えるでしょうか？

セサル：鍵となるのは、成長し、複利効果を生むものに投資することです。これは言うほど簡単ではありません。なぜなら、常に気晴らしや誘惑が存在するからです。私は多くの研究者が、すでに投入した労力を失いたくないという理由だけで、数ヶ月から数年をかけてプロジェクトに取り組んでいるのを見てきました。その代償は、10 年あるいは 20 年後にもっと重要になるかもしれない他のプロジェクトに着手できないことです。観測所（The Observatory of Economic Complexity）、Data USA、Pantheon のように聴衆を創出できるツールを構築するのは困難でしたが、長期的には実りある成果をもたらしています。重要な論文を数本執筆したり、書籍を完成させたりすることについても同じことが言えます。プロジェクトに取り組む際に自問すべき質問は、「10 年後にこのプロジェクトが今日よりも重要になると心から信じるか」という点です。答えが「はい」であれば、それはおそらく良いプロジェクトだと言えます。10 年前なら、私は自分自身に対してこのテストをより信頼し、『ほぼ完了』したプロジェクトからは速やかに手を引くよう言い聞かせていたでしょう。埋没費用は、研究キャリアにおいて最も高価なものです。

ヨハネス：若手研究者に対しては、いくつか提案できます。第一に、自分が何をするのかを動機づけるために、広範な問いと研究アジェンダを構築することです。自分自身が深く関心を持つ問題を持ち、その問題に関する部分的あるいは非常に特定の結果であっても興奮するほどでなければなりません。それができたら、実務的には独自データを生成することに大きな価値があると思います。私は誰もが知っているデータセットに高度に複雑な手法を適用するよりも、独自のデータセットに単純明快な方法を適用することを好みます。

ジェルメイン：私のアドバイスもセサールのものと同じです。死んだ馬に乗らないでください。博士号取得後、助教になるまでの数年間は、古いトピックから搾り取りながら新しい方向へ転換したくなる誘惑がありますが、それは二つの世界にまたがり、どちらもマスターできない状態になります。焦点を意図的に選び、真の専門性を築けるほど狭く、好奇心を保てるほど広く設定し、過去の仕事がもはや一致しなくなっても手放す覚悟を持ちましょう。それが無駄に思えても構いません。

サンドル：若い自分には、もっと早く、より多く協力するよう言うでしょう。この論文は 4 つの著者が 5 つの機関、4 カ国から成っています。もし私たち誰かが自分の領域内に留まっていたら、これは実現しなかったはずです。あなたの分野外の会議に参加し、関連性が薄く見える人とのコーヒーミーティングに「はい」と答え、尊敬する研究者には恐れずにコールドメールを送ってください。

ケビン：この分野についてさらに学びたいと考えている人に、お勧めの学習リソースはありますか？

César：経済複雑性観測所はウェブ体験向け、そして『無限のアルファベット』と『知識の法則』は、これらを文脈に位置づける書籍です。

Jermain：経済的な側面に興味を持つ開発者の方には、正直なところ経済複雑性観測所を閲覧し、ご自身の国を検索することをお勧めします。輸出しているものが何なのか、製品空間の中でどの位置にあるのかを確認した上で、ソフトウェアがどのように関わるかを考えてみてください。これは、数式に深入りする前に直感的な理解を築くための非常に分かりやすい方法です。

Kevin：Sándor さん、Johannes さん、Jermain さん、そして César さん、ありがとうございます！現在の活動やより広範なキャリアの軌跡について学べたことは大変興味深かったです。お時間を割いてお話しいただき、心から感謝申し上げます。これからも皆様の活動を確実に追いかけてまいります。

本記事「研究者らが GitHub Innovation Graph データを用いて国家の『デジタル複雑性』を明らかにする方法」は、The GitHub Blog に最初に掲載されました。

原文を表示

One of our goals for the GitHub Innovation Graph was to facilitate research on the economic impact of open source software and developer collaboration. In a paper recently published by Research Policy, four researchers used Innovation Graph data to do just that. I’m happy to share an interview with these researchers, along with our Q4 2025 data release.

The Research Policy paper examines whether the geography of open-source software production on GitHub can reveal the “digital complexity” of nations, and whether that complexity predicts GDP, inequality, and emissions in ways that traditional economic data misses.

Meet the four researchers:

Sándor Juhász is a research fellow at the Corvinus University of Budapest. His work focuses on economic geography, knowledge networks, and how spatial structures shape innovation.

Johannes Wachs is an Associate Professor at Corvinus University of Budapest, Director of the Center for Collective Learning at the Corvinus Institute of Advanced Study, and a researcher at the Complexity Science Hub in Vienna. His work sits at the intersection of computational social science and economic geography, with a particular focus on open-source software communities.

Jermain Kaminski is an Assistant Professor at the School of Business and Economics at Maastricht University. His research specializes in entrepreneurship, strategy, and causal machine learning, with a focus on how data-driven methods can improve decision-making and innovation. He is a cofounder of the Causal Data Science Meeting.

César A. Hidalgo is a professor at the Toulouse School of Economics and Corvinus University of Budapest, and he is the Director of the Center for Collective Learning. He is also the creator of the Observatory of Economic Complexity and cofounder of DataWheel.

Research Q&A

Kevin: Thanks so much for chatting, everyone! Could you give a quick high-level summary of the paper for our readers here?

Sándor: For the last fifteen years or so, economists have been measuring the complexity of national economies by looking at what physical products countries export, what patents they file, and what research they publish. These measures turn out to be remarkably good at predicting which countries will grow, which have high inequality, amongst many other macroeconomic features. But they all have a massive blind spot: software.

Jermain: Code doesn’t go through customs. It crosses borders through “git push”, cloud services, and package managers. So all that productive knowledge was essentially invisible, what some colleagues have called the “digital dark matter” of the economy. We decided to fix that using the GitHub Innovation Graph, which tracks how many developers in each economy push code in each programming language, based on IP addresses. We applied the Economic Complexity Index (ECI) to this data. The bottom line is that software ECI surfaces new information that trade flows, patents, and research data partly leave on the table. In particular, software ECI helps explain variation in GDP per capita and income inequality even after you control for all the traditional measures.

Johannes: We also found that countries don’t jump randomly between software specializations. They diversify into technology stacks that are related to what they already do, just like countries in the physical economy tend to move into products similar to what they already export. This is considered the “principle of relatedness,” and it holds for software too.

Kevin: Interesting! Could you provide an overview of the methods you used in your analysis?

Johannes: Sure. As mentioned, the core data comes from the GitHub Innovation Graph, which gives us quarterly counts of developers pushing code by economy and programming language for 163 economies and 150 languages from 2020 to 2023. But individual programming languages aren’t really the right unit, most real software uses bundles of languages together. A web app might combine HTML, CSS, and JavaScript; a data science project uses Python and Jupyter Notebook; systems programming pairs C with Assembly.

Sándor: So we built a separate dataset by querying the GitHub GraphQL API for all repositories active in 2024 to find which languages co-occur within the same repos. We computed cosine similarity between languages based on weighted co-occurrence, with a normalization scheme so that polyglot repos with twenty languages don’t dominate the signal, and then applied hierarchical clustering to group the 150 languages into 59 “software bundles.” Each bundle represents a coherent technology stack.

Jermain: …and from there, it’s the “standard” economic complexity pipeline. We build a country-by-bundle matrix, compute revealed comparative advantage, essentially asking, “does this country have a disproportionate share of developers in this bundle relative to the global average?”, binarize it, and then apply the iterative method to compute the Economic Complexity Index. Countries that specialize in many non-ubiquitous bundles score high, and countries that only specialize in things everyone does score low. For the relatedness analysis, we define proximity between bundles using co-specialization patterns. If countries that are good at bundle A also tend to be good at bundle B, those bundles are close in the software space. Then we test whether countries are more likely to enter bundles that are close to their existing specializations.

Kevin: Nice! Follow-up question: could you provide an “explain it like I’m five” overview of the methods you used in your analysis?

César: Think of countries like kitchens. Some kitchens can cook anything, since they have an abundance of ingredients and tools, from the rarest spices to the best knives. Others are more limited. Maybe they can boil rice and do a few other simple things. Since we cannot look at the kitchens directly, we need to infer their “complexity” based on the dishes they are able to produce. This is what the economic complexity index or ECI allows you to estimate. We can infer what’s going on in the kitchen by seeing if it is a chicken and rice operation, or a place that can produce sophisticated edible foams and souffles. Originally, these methods were applied to trade data, where the dishes coming out of the kitchen were a country’s exports, but in this paper, we applied that to software. A chicken-and-rice country is a Python and JavaScript country. A Michelin-star country is one that can program certified embedded systems for aerospace and defense.

Top 20 economies by software economic complexity

Ranking Economy Software ECI

1 Germany 1.739

2 Australia 1.730

3 Canada 1.729

4 Netherlands 1.727

5 France 1.702

6 United States 1.695

7 Poland 1.691

8 United Kingdom 1.687

9 Italy 1.672

10 Sweden 1.620

11 Switzerland 1.620

12 Hong Kong SAR 1.595

13 Norway 1.571

14 Japan 1.552

15 Spain 1.552

16 Russia 1.530

17 Singapore 1.468

18 Taiwan 1.464

19 Belgium 1.448

20 Finland 1.444

Kevin: Thanks, that’s super helpful. I’d be curious about the limitations of your paper and data that you wished you had for further work. What would the ideal datasets look like for you?

Johannes: One major drawback is that we only see public GitHub activity. That means we’re missing proprietary software entirely. Hence, we can’t see closed-source enterprise work, which is huge. So our measure likely underestimates software complexity in countries with a weaker open source software culture.

Sándor: The time window is another constraint. Four years of data (2020–2023) is enough for cross-sectional analysis but too short to credibly test long-run growth predictions, which is what economic complexity measures are really designed for. Economic structures shift over decades, not quarters. We’d love to have twenty years of this data.

Jermain: The dream dataset would combine GitHub-like activity data with information about the projects themselves, not just languages, but frameworks, libraries, and what the software actually does. Considering this dimension would be a natural next step for our project, and it would shed more light into software bundles and use cases. If we knew that a repo was building a fintech application versus a game engine, we could define much finer-grained capability bundles. GitHub Topics gives us a taste of this, and we used it as a robustness check, but it’s still noisy and incomplete.

Kevin: Do you have any predictions for the future? Recommendations for policymakers? Recommendations for developers?

César: Software is an interesting target for industrial policy because it is an industry that depends primarily on highly movable human capital (software developers). In principle, it provides an opportunity for development that can be incentivized via talent attraction programs. In practice, however, the high mobility of software talent can be a double-edged sword, since that makes it sensitive to consumer protection regulations that make it hard to work with data or worker protection schemes that distribute the risk of innovation to small and medium size firms (e.g. laws that on paper protect workers, but that in reality pass on that responsibility to the firms). The countries that figure out how to attract software talent without suffocating it with well-intentioned but poorly designed regulation will pull ahead.

Johannes: For developers, understanding that places are highly specialized in the kind of software they produce is useful when they are looking to relocate. Developers can use the product space representation of software capabilities to know which countries their skillsets are a good match for.

Jermain: Looking ahead, the big question is what generative AI does to this picture. If AI coding assistants lower the barrier to working in new programming languages, does relatedness weaken? Do countries diversify faster? Or does it reinforce existing advantages because the countries with the best AI infrastructure benefit most? We’re working on this, and Johannes and his colleagues have a new paper in Science on tracking the global diffusion of AI-assisted coding on GitHub. I think the answer will reshape how we think about digital complexity within the next five years. One further consideration would be how classifications of software or software bundles would be represented as NAICS or NACE industry codes.

Sándor: I’d add a prediction: I think we’ll see economic complexity indices based on software data become a standard part of the policymaker’s toolkit within the decade, sitting right alongside the trade-based measures. The data is open, it updates quarterly, and it captures something that traditional data genuinely can’t.

Personal Q&A

Kevin: I’d like to change gears a bit to chat more about your personal stories. Johannes, I understand that you have a background in computational social science and network science, which is a bit different from the traditional economics path. Tell us more about your path to research.

Johannes: I actually started in mathematics and then moved into computational social science during my PhD at Central European University in Budapest. I became enchanted by the opportunities that digital data traces present for studying human behavior. I like using network methods because they help us move between the micro level activity and interactions found in such traces and the macro outcomes. I stumbled into open source research in particular when I realized that GitHub data was this incredibly rich, publicly available record of valuable knowledge production that few people were using to study social science questions.

Kevin: Sándor, I see you have a background in economic geography, which is a more traditional route compared to computational social science. What was your path toward working with software data?

Sandor: I received my PhD in economic geography at Utrecht University, in a research community that was already using economic complexity to study regional development. So I was trained in thinking about places—cities, regions, industries—through the lens of networks and capability accumulation.

Kevin: Jermain, it looks like you developed practical technical expertise through some entrepreneurial projects in parallel with academic training.

Jermain: During my PhD at RWTH Aachen, I was a visiting researcher with Cèsar at MIT. In that time, I was also working with a colleague on a project called Moviegalaxies.com (open data) and later worked on analyzing text, speech and video data in Kickstarter projects. It was my first multimodal machine learning pipeline. From my network analysis projects, I somehow ended up analyzing passing networks for a larger German soccer team. These days my research is mostly concerned with causality and causal machine learning. In this capacity, I co-founded the Causal Data Science meeting with my colleague Paul Hünermund.

Kevin: César, do I have right that you have a background in Physics?

César: I started in physics, with a PhD at Notre Dame focused on complex networks. During that time, I realized that network tools could be used to describe the evolution and fate of economies. Eventually, this became a field that we know today as economic complexity, which studies the process of economic development by using tools from physics, economics, and computer science.

Kevin: Finding a niche that you’re passionate about is such a joy, and I’m curious about how you’ve found living in that niche. What’s the day-to-day like for you?

Johannes: Honestly, in research, the day-to-day is a mix of writing code, writing papers, and talking to people, then iterating. Of course, working at a university usually comes with teaching and administration, too. I like that I have a good amount of freedom in what I choose to work on. If a project or direction doesn’t spark joy, I can usually shift my focus. That is a unique thing.

Sándor: I’d add that one of the best parts of this niche is the interdisciplinary community. On any given week I might talk to an economic geographer, a computer scientist, and a physicist about the same research question. That’s unusual and very stimulating.

Kevin: Have things changed since generative AI tooling came along? Have you found generative AI tools to be helpful?

Johannes: Absolutely. We use LLM tools regularly now for things like debugging data pipelines, drafting boilerplate code, and even sanity-checking statistical approaches. It’s particularly useful in a project like where you have a lot of different methods and need to coordinate work in a team. That said, LLMs are much more helpful if you already have a clear idea in mind.

Kevin: Do you have any advice for folks who are starting out in software engineering or research? What tips might you give to a younger version of yourself, say, from 10 years ago?

César: The key is to invest in things that grow or compound. This is easier said than done because there are always distractions and temptations. I’ve seen many scholars spend months or years working on projects just because they don’t want to lose the work that they’ve already put into them. The cost of doing that is working on other projects that might matter more in ten or twenty years. Building tools that can generate an audience, like The Observatory of Economic Complexity, Data USA, or Pantheon, was challenging, but they have borne fruit for a long time. The same is true about working on a few important papers or completing a book. The question you need to ask when working on a project is whether you honestly believe that the project will be more important in a decade from now than today. If the answer is yes, that’s probably a good project. Ten years ago, I would have told myself to trust that test more and to walk away from “almost done” projects faster. Sunk costs are the most expensive thing in a research career.

Johannes: In can rather make suggestions for young researchers. The first is to build a broad question and research agenda to motivate what you do. You have to have a problem you care about so much that even partial or highly specific results about that problem get you excited. Once you have that, in practice I think there is a lot of value in generating your own data. I prefer applying a straightforward method to a bespoke dataset than applying a highly complex method to a dataset everyone knows.

Jermain: My advice echoes César’s: don’t ride a dead horse. In the years after the PhD and into assistant professorship, it’s tempting to keep milking old topics while pivoting to new ones, but this leaves you straddling two worlds and mastering neither. Pick your focus deliberately, narrow enough to build real expertise, broad enough to stay curious, and be willing to let go of past work that no longer aligns, even if it feels wasteful.

Sándor: I’d tell my younger self to collaborate more and earlier. This paper has four authors across five institutions in four countries. That wouldn’t have happened if any of us had stayed in our silos. Go to conferences outside your field, say yes to coffee meetings with people whose work seems tangentially related, and don’t be afraid to cold-email researchers whose work you admire.

Kevin: Are there any learning resources you might recommend to someone interested in learning more about this space?

César: The Observatory of Economic Complexity, for a web experience, and The Infinite Alphabet: and The Laws of Knowledge, for a book that puts this in context.

Jermain: If you’re a developer curious about the economics angle, I’d honestly just recommend browsing the Observatory of Economic Complexity and looking up your own country. See what it exports, where it sits in the product space, and then think about how software fits in. It’s a very intuitive way to build the intuition before diving into the math.

Kevin: Thank you, Sándor, Johannes, Jermain, and César! It’s been fascinating to learn about your current work and broader career trajectories. We truly appreciate you taking the time to speak with us and will absolutely keep following your work.

The post How researchers are using GitHub Innovation Graph data to reveal the “digital complexity” of nations appeared first on The GitHub Blog.

この記事をシェア

Simon Willison Blog2026年6月26日 02:21

Simon Willison Blog の datasette-export-database 0.3a2 リリース

MarkTechPost2026年6月26日 17:48

Apple のオープンソース Swift ツール「Container」：Apple Silicon で軽量 VM として Linux コンテナを実行

AWS Machine Learning Blog重要度42026年6月26日 01:38

Amazon Bedrock を活用した AI エージェントによる自己サービス型 AWS ヘルス分析の構築

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む