TLDR AI·2026年5月7日 09:00·約14分

ProgramBench：ソースコードなしでソフトウェアを再現するエージェント評価ベンチマーク

#Software Engineering #Agent Evaluation #Benchmarking #Code Generation

TL;DR

ProgramBench は、ソースコードなしでドキュメントと実験のみを用いてソフトウェアを実行可能ファイルとして再構築する課題を提供し、AI エージェントの設計能力を厳密に評価するベンチマークである。

AI深層分析2026年5月7日 23:08

重要/ 5段階

深度40%

キーポイント

ソースコード非依存の評価基準

従来のデコンパイルや外部ライブラリへの依存を排除し、ドキュメントと試行錯誤のみでソフトウェアを再構築する能力に焦点を当てている。

大規模かつ多様なタスクセット

ターミナルユーティリティからコンパイラやライブラリまで 200 の課題を用意し、合計 248,000 以上の振る舞いテストで厳密に検証する。

安全なサンドボックス環境での実装

エージェントは外部援助なしに、完全に安全なサンドボックス内でゼロから設計・実装を行うことを要求される。

影響分析・編集コメントを表示

影響分析

ProgramBench は、AI エージェントが単にコードを出力するだけでなく、仕様を理解しシステムを構築する「エンジニアリング能力」を定量化する重要な転換点となる。これにより、開発現場における AI の実用性を評価する基準が、生成速度や文脈理解から、実稼働可能なソフトウェアの作成能力へとシフトする契機となるだろう。

編集コメント

既存のコード生成ベンチとは異なり、ドキュメントからの逆推論と設計能力に焦点を当てた画期的な評価指標と言えます。

image

言語モデルはゼロからプログラムを再構築できるでしょうか？

コンパイルされたバイナリとそのドキュメントのみが与えられた場合、エージェントは元のプログラムの動作を再現する完全なコードベースを設計し実装する必要があります。

Model

Agent

Resolved

help_outline

隠された振る舞いテストによって測定される、完全に解決されたインスタンスの数。なお、振る舞いテストはすべての可能な入力を網羅することは決してできません。ProgramBench の振る舞いテストは、誤検知が発生した場合に容易に拡張可能です。

Almost resolvedAlmost

help_outline

エージェントの解決策が全振る舞いテストの 95% 以上を解決するインスタンス。詳細結果はこちらをご覧ください。

image

Claude Opus 4.7

Anthropic

mini-SWE-agent

3.0%

image

Claude Opus 4.6

Anthropic

mini-SWE-agent

2.5%

image

Claude Sonnet 4.6

Anthropic

mini-SWE-agent

1.0%

image

GPT 5.4

OpenAI

mini-SWE-agent

0.0%

image

Gemini 3.1 Pro

Google

mini-SWE-agent

0.0%

image

Gemini 3 Flash

Google

mini-SWE-agent

0.0%

image

Claude Haiku 4.5

Anthropic

mini-SWE-agent

0.0%

image

GPT 5.4 mini

OpenAI

mini-SWE-agent

0.0%

image

GPT 5 mini

OpenAI

mini-SWE-agent

0.0%

ProgramBench について

各タスクにおいて、エージェントは実行可能ファイルとそのドキュメントを受け取り、与えられた実行可能ファイルを再実装する必要があります。エージェントは実行可能ソースコードの*いずれにも*アクセスできず、実行可能ファイルを逆コンパイルすることも、インターネットを使用することもできません。これらには 200 のタスクが含まれており、異なるプログラムの複雑さを網羅しています。範囲は、jq や ripgrep といった小規模なターミナルユーティリティから、PHP コンパイラ、FFmpeg、SQLite といった大規模なソフトウェアプロジェクトまで多岐にわたります。

エージェントは言語を選択し、アーキテクチャを設計し、すべてのソースコードを作成し、ビルドスクリプトを生成しなければなりません。設計上の決定はすべてモデルが行います。

エージェントがプログラムを提出すると、テストスイートが候補プログラムの動作を元のプログラムと比較します。候補プログラムが合格するのは、そのタスクに対するすべてのテストに合格した場合のみです。

当社のテストスイートはエージェント駆動のファジング（fuzzing）によって生成され、200 のタスクに対して合計 248,000 件以上の動作テストを含んでいます。

ProgramBench のタスクは完全に解決可能なのでしょうか？

はい。エージェントは与えられたプログラムを任意の入力で実行し、その動作を正確に観察できるため、実験を通じて発見できない隠れた要素は何もありません。このベンチマークは困難ですが、設計上解決可能です：すべての参照実行ファイルはテストスイートをパスしています。詳細はブログ記事をお読みください。

なぜ ProgramBench のスコアはこれほど低いのでしょうか？

一からプログラムを構築することは、本質的に困難なタスクです。エージェントは現在、多くのタスクで部分的な進捗を遂げています（詳細は拡張結果を参照してください）が、すべてのテストに完全に合格するのはまだ達成できていません。

エージェントは真にアーキテクチャ設計を行う必要があります。 これは、他の全体リポジトリ生成プロジェクトとは異なり、私たちがエージェントに対して何らかの手がかりや構造を与えないためです。つまり、エージェントは自らの解決策を自らアーキテクチャ設計する必要があります（「ProgramBench はどのように異なるのか？」[#faq-related-work] を参照）。

ハネスの調整なし。 他の最近および並行して行われた研究でも、単数または少数のタスクに対して大幅なハネスの調整を行いました。私たちは意図的にこれを避けています。なぜなら、厳選された少数のタスク上で調整されたハネスからのヘッドラインスコアは、エージェントが実際にゼロからソフトウェアを構築する能力を過大評価する可能性があるからです。代わりに、ProgramBench は単一の汎用ハネスを使用して、すべてのタセット全体で評価されます。

クリーンルーム実装。 私たちは不正行為を防止するために多大な注意を払っています。エージェントはインターネットアクセスのないサンドボックス化されたコンテナ内で実行されるため、元のソースコードを取得したり、他のいかなる種類の支援を得たりすることはできません。

デコンパイルの禁止。 「タスクはデコンパイルで解決可能か？」についてはこちらをご覧ください。

関連研究については、論文のセクション 6 でレビューしています。また、不正行為については以下の FAQ およびセクション 4.1 でも議論しています。

あなたのエージェントのスキャフォールド（骨組み）はすべてのタスクを解決するのに十分ですか？

広く採用されたベースライン。 私たちはmini-SWE-agentを使用します。これは、他のベンチマーク（SWE-bench Verified, SWE-bench Multilingual, Terminal-bench）においてベースラインとして広く採用されていることに加え、そのスキャフォールドが意図的に最小限に抑えられているためです。これにより、モデルの能力とハッチ（実行環境）設計の間の交絡要因を減らすことができます。他の多くのエージェント（Claude Code など、明らかに数十万行のコードを持つもの）も非透明的な方法で絶えず変化していますが、mini-SWE-agent は将来にわたってモデルのパフォーマンスを厳密に比較可能にするものです。

ほぼ実行時の制限なし。 非常に稀な例外を除き、モデルは私たちの寛大な時間またはステップの制限を超えて実行するのではなく、意図的に解決策を提出しており、コンテキストウィンドウ（文脈窓）を枯渇させることもありません。総コストに制限を設けていないため、私たちの実行では最大 5,000 ドル（Sonnet 4.5 の場合）の費用がかかっています。

難易度の多様性。 ProgramBench は、数行のコードしかない非常に短いリポジトリから極めて大規模なものまで、さまざまな難易度のタスクを意図的に含んでいます。したがって、極端に低いスコアは、マルチエージェントシステムでなければタスクが解決できないという指標というよりも、モデルの能力不足を示すシグナルであると私たちは考えています。それでもなお、マルチエージェントシステムでしか解決できないタスクを含む最初の体系的ベンチマークの一つになることを嬉しく思います。

新しいスキャフォールド競争の開始。 私たちは mini-SWE-agent が適切なベースラインであり、（一部の）タスクを確実に解決できると信じています。しかし、ProgramBench が新たなスキャフォールド競争のきっかけとなれば、私たちはさらに嬉しく思います。提出をまもなく開始いたします。

エージェントは不正行為できるのか？

エージェントはインターネットアクセスのないサンドボックス化されたコンテナ内で実行され、バイナリには実行のみ許可されており、逆アセンブルツールへのアクセスもありません。これらの制限がない初期の試行では、モデルは GitHub からソースコードのリポジトリをクローンしたり、パッケージマネージャーを通じてコードをダウンロードしたりする近道を見つけました。詳細は当社のブログ記事および論文の 4.1 セクションをご覧ください。

なぜ、またどのようにして逆コンパイルをブロックするのか？

エージェントに与えられる実行ファイルには実行権限のみがあり、読み取り権限はありません。つまり、実行以外のすべての操作（デコンパイラの実行、ディスアセンブラの使用、objdump や strings、hexdump の呼び出しなど）は失敗します。

これは、ProgramBench が「LM はどの程度ゼロからプログラムを構築できるか」という問いに答えるためであり、「LM はどの程度逆コンパイルされたコードの断片をつなぎ合わせているか」ではなく、前者の問いに焦点を当てるためです。

リーダーボードはどのようにソートされているのか？主要な指標は何ですか？

ProgramBench で報告すべき主要な指標は「完全に解決されたインスタンス」です。現在、主要な指標のスコアが低い間も参照点として、「ほぼ解決された」（テストケースの 95% 以上がパスする）という項目を追加で報告しています。リーダーボードのソート順は、まず「完全に解決された」、次に「ほぼ解決された」、最後に平均テストパス率となっています。

モデルのパフォーマンスを詳細に理解するには、詳細なリーダーボードのプロットを参照することをお勧めします。また、「他の指標も検討しましたか？」(Have you considered other metrics?) もご覧ください。

どのようにしてリーダーボードに提出すればよいですか？

公開の提出ポータルは近日中に開設されます。詳細は著者 (John, Kilian) の更新情報をご確認ください。

なぜインターネットへのアクセスを許可しないのですか？

私たちは、インターネットへのアクセスを含むさまざまな推論設定について徹底的に研究を行いました。その結果、インターネットへのアクセスを許可すると不正行為が溢れ出し、言語モデル（LM）を審査員として用いて不正な解決策を指摘し失格にする必要が生じることが分かりました。これによりベンチマークの信頼性が低下します。特に、オンラインでソースコードを取得することにおける「不正」の定義は、一見すると明確に見えるほど単純ではないためです。

ただし、不正行為が含まれていない事例においては、インターネットへのアクセスを許可してもスコアの劇的な向上は見られませんでした。

詳細なアブレーション実験については、論文のセクション 4.1 および John の解説をご覧ください。

他の指標は検討されましたか？例えば、平均して通過するテストの数などです。

はい、現在の「解決済み」指標を採用することに多くの時間をかけて検討しました。私たちの最初の問いは「言語モデル（LM）はゼロからプログラムを構築できるのか？」というものでした。これに最も関連する指標は、完全に構築可能なプログラムの割合です。平均的なテスト通過率を報告することは極めて誤解を招くことになります。なぜなら、各事例には非常に単純なテスト（フラグの存在確認や、実行可能ファイルに --help を指定して呼び出した場合の結果確認など）が含まれているからです。

「ほぼ解決済み」といったより緩やかな指標の使用も検討しましたが、テストの 95% または 99% が解決された場合に緩和するアプローチにも問題があります。まず、一部のタスクでは約 1.5 万個のテストが存在します。その 1% でも 100 個のテストに相当し、たった一つのテスト失敗がプログラムに深刻な問題があることを示す可能性があります。したがって、「ほぼ解決済み」という指標は、主要な「解決済み」指標がすべてのモデルを区別するのに十分な信号を得るまでの間、追加的な指針として機能するだけです。

ただし、すべての補助指標は、モデルやスケフォールドの診断と改善には依然として有用です。単にベンチマークとしての指標としては適切ではないのです。詳細については拡張結果をご確認ください。また、「リーダーボードのソート方法」についてもご覧ください：「How is the leaderboard sorted?」

[junegunn/fzf

:cherry_blossom: コマンドライン用ファジーファインダー

79,721

Best score: 82%](https://programbench.com/task/junegunn__fzf.b56d614/)

[jesseduffield/lazygit

git コマンド用のシンプルなターミナル UI

76,901

Best score: 56%](https://programbench.com/task/jesseduffield__lazygit.1d0db51/)

[BurntSushi/ripgrep

ripgrep は、gitignore を尊重しながら再帰的にディレクトリを検索し、正規表現パターンを探します

62,855

Best score: 80%](https://programbench.com/task/burntsushi__ripgrep.3b7fd44/)

[FFmpeg/FFmpeg

https://git.ffmpeg.org/ffmpeg.git のミラー

59,217

Best score: 5%](https://programbench.com/task/ffmpeg__ffmpeg.360a402/)

[sharkdp/bat

拡張機能付きの cat(1) クローン。

58,487

Best score: 33%](https://programbench.com/task/sharkdp__bat.f822bd0/)

[typst/typst

強力でありながら学習が容易な、マークアップベースの組版システム。

52,957

Best score: 28%](https://programbench.com/task/typst__typst.88356d0/)

[jgm/pandoc

ユニバーサルマークアップコンバーター

43,632

Best score: 14%](https://programbench.com/task/jgm__pandoc.5caad90/)

[sharkdp/fd

'find' のシンプルで高速、かつユーザーフレンドリーな代替手段

42,668

Best score: 78%](https://programbench.com/task/sharkdp__fd.40d8eb3/)

[php/php-src

PHP インタープリタ

40,030

Best score: 5%](https://programbench.com/task/php__php-src.c891263/)

[duckdb/duckdb

DuckDB は、分析用インプロセス SQL データベース管理システムです

37,657

cpp

Best score: 12%](https://programbench.com/task/duckdb__duckdb.bdb65ec/)

[ajeetdsouza/zoxide

より賢い cd コマンド。主要なすべてのシェルをサポートします。

35,994

Best score: 76%](https://programbench.com/task/ajeetdsouza__zoxide.67ca1bc/)

[jqlang/jq

コマンドライン用 JSON プロセッサ

34,541

Best score: 90%](https://programbench.com/task/jqlang__jq.b33a763/)

[dandavison/delta

git、diff、grep、rg --json、および blame の出力に対する構文強調表示ページャー

30,445

Best score: 37%](https://programbench.com/task/dandavison__delta.acd758f/)

[sharkdp/hyperfine

コマンドラインベンチマークツール

27,960

Best score: 54%](https://programbench.com/task/sharkdp__hyperfine.327d5f4/)

[ggreer/the_silver_searcher

ack に似たコード検索ツールですが、より高速です。

27,080

Best score: 59%](https://programbench.com/task/ggreer__the_silver_searcher.a61f178/)

[facebook/zstd

Zstandard - 高速なリアルタイム圧縮アルゴリズム

27,013

Best score: 69%](https://programbench.com/task/facebook__zstd.1168da0/)

[facebookresearch/fastText

テキスト表現と分類のための高速ライブラリ。

26,511

cpp

Best score: 76%](https://programbench.com/task/facebookresearch__fasttext.1142dc4/)

[robertdavidgraham/masscan

TCP ポートスキャナ。非同期で SYN パケットを大量に送信し、5 分未満でインターネット全体をスキャンします。

25,544

Best score: 57%](https://programbench.com/task/robertdavidgraham__masscan.b99d433/)

[tree-sitter/tree-sitter

プログラミングツール向けの増分構文解析システム

24,953

Best score: 37%](https://programbench.com/task/tree-sitter__tree-sitter.5e23cca/)

[FiloSottile/age

小さな明示的なキー、設定オプションなし、UNIX スタイルの組み合わせ性を備えた、シンプルで現代的かつ安全な暗号化ツール（および Go ライブラリ）。

22,077

Best score: 63%](https://programbench.com/task/filosottile__age.706dfc1/)

[rust-lang/mdBook

マークダウンファイルから本を作成します。Gitbook のようなものですが、Rust で実装されています。

21,541

Best score: 55%](https://programbench.com/task/rust-lang__mdbook.37273ba/)

[jarun/nnn

n³ 非伝統的なターミナルファイルマネージャー

21,506

Best score: 98%](https://programbench.com/task/jarun__nnn.cb2c535/)

[antonmedv/fx

ターミナル用 JSON ビューア & プロセッサ

20,433

Best score: 76%](https://programbench.com/task/antonmedv__fx.86d0d34/)

[mikefarah/yq

yq は、ポータブルなコマンドライン YAML、JSON、XML、CSV、TOML、HCL およびプロパティプロセッサです

15,281

Best score: 39%](https://programbench.com/task/mikefarah__yq.602586d/)

[Y2Z/monolith

⬛️ 完全な Web ページを単一の HTML ファイルとして保存するための CLI ツールおよびライブラリ

15,024

Best score: 51%](https://programbench.com/task/y2z__monolith.8702e66/)

原文を表示

ProgramBench logo

Can language models rebuild programs from scratch?

Given only a compiled binary and its documentation, agents must architect and implement

a complete codebase that reproduces the original program's behavior.

Model

Agent

Resolved

help_outline

The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise.

Almost resolvedAlmost

help_outline

Instances where the agent's solution solves ≥ 95% of all behavioral tests. See extended results.

Claude Opus 4.7

Anthropic

mini-SWE-agent

3.0%

Claude Opus 4.6

Anthropic

mini-SWE-agent

2.5%

Claude Sonnet 4.6

Anthropic

mini-SWE-agent

1.0%

GPT 5.4

OpenAI

mini-SWE-agent

0.0%

Gemini 3.1 Pro

Google

mini-SWE-agent

0.0%

Gemini 3 Flash

Google

mini-SWE-agent

0.0%

Claude Haiku 4.5

Anthropic

mini-SWE-agent

0.0%

GPT 5.4 mini

OpenAI

mini-SWE-agent

0.0%

GPT 5 mini

OpenAI

mini-SWE-agent

0.0%

About ProgramBench

In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to *any* of the executable's source code, it cannot de-compile the executable, and cannot use the internet. There are 200 tasks in total covering different program complexities, ranging from small terminal utilities like jq and ripgrep to massive software projects like the PHP compiler, FFmpeg, and SQLite.

The agent must choose a language, design the architecture, write all source code, and produce a build script. Every design decision is the model's to make.

Once the agent submits a program, our test suite compares the candidate program's behavior against the original program. A candidate program passes only if all tests for that task pass.

Our test suite is generated via agent-driven fuzzing, and it comprises more than 248,000 total behavioral tests for our 200 tasks.

Can tasks in ProgramBench be fully solved at all?

Yes. The agent can run the given program with any input and observe exactly what it does, so there's nothing hidden that can't be discovered through experimentation. The benchmark is hard, but it's solvable by design: all the reference executables pass our test suites. Read more in our blog post.

Why are ProgramBench scores so low?

Building a program from scratch is a fundamentally challenging task. Agents do currently make partial progress on many tasks (see the extended results for details), but fully passing every test is still out of reach.

Agents truly have to architect. This is in part because unlike other whole-repo generation projects, we give no hints or structure to the agent, meaning that the agent truly has to architect its own solutions (see "How is ProgramBench different?").

No harness tuning. Other recent and concurrent work also performed substantial harness tuning for a single or a handful number of tasks. We deliberately avoid this, since headline scores from a tuned harness on a curated handful of tasks can substantially overstate how capable agents really are at building software from scratch. Instead, ProgramBench is evaluated with a single generic harness across the entire task set.

Cleanroom implementation. We take substantial precautions to prevent cheating. Agents run in sandboxed containers without internet access, so they cannot retrieve the original source code or obtain any other form of help.

No decompilation. See "Can tasks be solved with decompilation?"

We review related work in section 6 of the paper. We also discuss cheating in the FAQ below and in section 4.1.

Is your agent scaffold sufficient to solve all tasks?

Widely adopted baseline. We use mini-SWE-agent because it is both widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in its scaffolding, reducing confounds between model capability and harness design. Most other agents (like Claude Code with apparently several 100k lines of code) are also constantly changing in non-transparent ways, while mini-SWE-agent will allow for apples-to-apples performance comparison of models for the foreseeable future.

Almost no runtime limitations. With very few exceptions, models submit their solutions deliberately rather than exceeding our generous time or step limits, and they never exhaust their context window. Because we do not limit total cost, our runs have cost up to $5k (for Sonnet 4.5).

Varying degree of difficulty. ProgramBench deliberately includes tasks from various degrees of difficulty, from very short repositories of only a few thousand lines of code to extremely large ones. We believe that the extremely low scores are therefore more of a signal of inadequate model capabilities rather than an indicator that only multi-agent systems can solve our tasks. Nonetheless, we would be excited to be one of the first systematic benchmarks that includes tasks that can only be solved by multi-agent systems.

Kicking off a new scaffold race. We believe that mini-SWE-agent is the right choice of baseline and that it can absolutely solve (some of) the tasks. However, we'd be more than excited if ProgramBench kicks off a new scaffold race! We will be opening submissions soon.

Can agents cheat?

Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. In early trials without these restrictions, models found shortcuts like cloning source repositories from GitHub or downloading code through package managers. Read more in our blog post and in section 4.1 of the paper.

Why and how do you block decompilation?

The executable that is given to the agent only has execution, not read permissions. That means that any operation that is not execution (such as running a decompiler, disassembler, objdump, strings, or hexdump) will fail.

We do this because we want ProgramBench to answer the question "How well can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".

How is the leaderboard sorted? What's the primary metric?

The primary metric that should be reported for ProgramBench are fully resolved instances. We currently report "almost resolved" (more than 95% of test cases pass) as an additional point of reference while the scores of our primary metric are low. The leaderboard is sorted by fully resolved first, almost resolved second, and finally the average test pass rate.

For a detailed understanding of model performance, we recommend the plot at the detailed leaderboard. See also: "Have you considered other metrics?"

How do I submit to the leaderboard?

A public submission portal is coming soon. Follow the authors (John, Kilian) for updates.

Why do you not allow internet?

We have extensively studied different inference settings, including allowing internet. We find that allowing internet leads to an abundance of cheating that requires LM as a judge to flag and disqualify solutions. This makes the benchmark less reliable, especially because defining exactly what cheating means in the context of obtaining source online is not as clear cut as it might seem.

However, except for instances that contained cheating, we did not observe a dramatic improvement of scores when allowing internet.

Find our ablations in section 4.1 of the paper and John's explanation.

Have you considered other metrics? E.g., average number of tests passed?

Yes, we've decided on our current resolved metric after a lot of thought. Our initial question was "Can LMs build programs from scratch?", and the most relevant metric is the fraction of programs that can be fully built. Reporting an average test pass rate would be extremely misleading, because every instance includes very simple tests (such as checking for the existence of flags, checking what happens if you call the executable with --help etc.).

We've also thought about using a more relaxed metric like "almost resolved". However, relaxing to ≥95% of tests solved or even 99% of tests solved is also problematic. First, for some of our tasks, we have almost 15k tests. Even 1% of that is still 100 tests. And even a single failed test can indicate severe issues with a program. Therefore "almost resolved" only serves as additional orientation, until the main "resolved" metric has enough signal to differentiate all models.

However, all auxiliary metrics are still useful for diagnosing and improving models and scaffolds! They're just not the right metric as a benchmark. Check the extended results for more information. See also: "How is the leaderboard sorted?"

[junegunn/fzf

:cherry_blossom: A command-line fuzzy finder

79,721

Best score: 82%](https://programbench.com/task/junegunn__fzf.b56d614/)

[jesseduffield/lazygit

simple terminal UI for git commands

76,901

Best score: 56%](https://programbench.com/task/jesseduffield__lazygit.1d0db51/)

[BurntSushi/ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

62,855

Best score: 80%](https://programbench.com/task/burntsushi__ripgrep.3b7fd44/)

[FFmpeg/FFmpeg

Mirror of https://git.ffmpeg.org/ffmpeg.git

59,217

Best score: 5%](https://programbench.com/task/ffmpeg__ffmpeg.360a402/)

[sharkdp/bat

A cat(1) clone with wings.

58,487

Best score: 33%](https://programbench.com/task/sharkdp__bat.f822bd0/)

[typst/typst

A markup-based typesetting system that is powerful and easy to learn.

52,957

Best score: 28%](https://programbench.com/task/typst__typst.88356d0/)

[jgm/pandoc

Universal markup converter

43,632

Best score: 14%](https://programbench.com/task/jgm__pandoc.5caad90/)

[sharkdp/fd

A simple, fast and user-friendly alternative to 'find'

42,668

Best score: 78%](https://programbench.com/task/sharkdp__fd.40d8eb3/)

[php/php-src

The PHP Interpreter

40,030

Best score: 5%](https://programbench.com/task/php__php-src.c891263/)

[duckdb/duckdb

DuckDB is an analytical in-process SQL database management system

37,657

cpp

Best score: 12%](https://programbench.com/task/duckdb__duckdb.bdb65ec/)

[ajeetdsouza/zoxide

A smarter cd command. Supports all major shells.

35,994

Best score: 76%](https://programbench.com/task/ajeetdsouza__zoxide.67ca1bc/)

[jqlang/jq

Command-line JSON processor

34,541

Best score: 90%](https://programbench.com/task/jqlang__jq.b33a763/)

[dandavison/delta

A syntax-highlighting pager for git, diff, grep, rg --json, and blame output

30,445

Best score: 37%](https://programbench.com/task/dandavison__delta.acd758f/)

[sharkdp/hyperfine

A command-line benchmarking tool

27,960

Best score: 54%](https://programbench.com/task/sharkdp__hyperfine.327d5f4/)

[ggreer/the_silver_searcher

A code-searching tool similar to ack, but faster.

27,080

Best score: 59%](https://programbench.com/task/ggreer__the_silver_searcher.a61f178/)

[facebook/zstd

Zstandard - Fast real-time compression algorithm

27,013

Best score: 69%](https://programbench.com/task/facebook__zstd.1168da0/)

[facebookresearch/fastText

Library for fast text representation and classification.

26,511

cpp

Best score: 76%](https://programbench.com/task/facebookresearch__fasttext.1142dc4/)

[robertdavidgraham/masscan

TCP port scanner, spews SYN packets asynchronously, scanning entire Internet in under 5 minutes.

25,544

Best score: 57%](https://programbench.com/task/robertdavidgraham__masscan.b99d433/)

[tree-sitter/tree-sitter

An incremental parsing system for programming tools

24,953

Best score: 37%](https://programbench.com/task/tree-sitter__tree-sitter.5e23cca/)

[FiloSottile/age

A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.

22,077

Best score: 63%](https://programbench.com/task/filosottile__age.706dfc1/)

[rust-lang/mdBook

Create book from markdown files. Like Gitbook but implemented in Rust

21,541

Best score: 55%](https://programbench.com/task/rust-lang__mdbook.37273ba/)

[jarun/nnn

n³ The unorthodox terminal file manager

21,506

Best score: 98%](https://programbench.com/task/jarun__nnn.cb2c535/)

[antonmedv/fx

Terminal JSON viewer & processor

20,433

Best score: 76%](https://programbench.com/task/antonmedv__fx.86d0d34/)

[mikefarah/yq

yq is a portable command-line YAML, JSON, XML, CSV, TOML, HCL and properties processor

15,281

Best score: 39%](https://programbench.com/task/mikefarah__yq.602586d/)

[Y2Z/monolith

⬛️ CLI tool and library for saving complete web pages as a single HTML file

15,024

Best score: 51%](https://programbench.com/task/y2z__monolith.8702e66/)

この記事をシェア

GitHub Blog重要度42026年6月26日 07:59

GitHub Copilot エージェント型ハッチのモデル・タスク間での性能と効率の評価

TLDR AI2026年6月26日 09:00

研究科学者の就職活動から得た驚くべき教訓（11 分読）

TLDR AI2026年6月26日 09:00

ツール使用型 LLM エージェントの脆弱性評価手法「RHB」を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年5月7日 09:00·約14分

ProgramBench：ソースコードなしでソフトウェアを再現するエージェント評価ベンチマーク

#Software Engineering #Agent Evaluation #Benchmarking #Code Generation

TL;DR

AI深層分析2026年5月7日 23:08

重要/ 5段階

深度40%

キーポイント

ソースコード非依存の評価基準

従来のデコンパイルや外部ライブラリへの依存を排除し、ドキュメントと試行錯誤のみでソフトウェアを再構築する能力に焦点を当てている。

大規模かつ多様なタスクセット

ターミナルユーティリティからコンパイラやライブラリまで 200 の課題を用意し、合計 248,000 以上の振る舞いテストで厳密に検証する。

安全なサンドボックス環境での実装

エージェントは外部援助なしに、完全に安全なサンドボックス内でゼロから設計・実装を行うことを要求される。

影響分析・編集コメントを表示

影響分析

編集コメント

既存のコード生成ベンチとは異なり、ドキュメントからの逆推論と設計能力に焦点を当てた画期的な評価指標と言えます。

image

言語モデルはゼロからプログラムを再構築できるでしょうか？

Model

Agent

Resolved

help_outline

Almost resolvedAlmost

help_outline

エージェントの解決策が全振る舞いテストの 95% 以上を解決するインスタンス。詳細結果はこちらをご覧ください。

image

Claude Opus 4.7

Anthropic

mini-SWE-agent

3.0%

image

Claude Opus 4.6

Anthropic

mini-SWE-agent

2.5%

image

Claude Sonnet 4.6

Anthropic

mini-SWE-agent

1.0%

image

GPT 5.4

OpenAI

mini-SWE-agent

0.0%

image

Gemini 3.1 Pro

Google

mini-SWE-agent

0.0%

image

Gemini 3 Flash

Google

mini-SWE-agent

0.0%

image

Claude Haiku 4.5

Anthropic

mini-SWE-agent

0.0%

image

GPT 5.4 mini

OpenAI

mini-SWE-agent

0.0%

image

GPT 5 mini

OpenAI

mini-SWE-agent

0.0%

ProgramBench について

ProgramBench のタスクは完全に解決可能なのでしょうか？

なぜ ProgramBench のスコアはこれほど低いのでしょうか？

デコンパイルの禁止。 「タスクはデコンパイルで解決可能か？」についてはこちらをご覧ください。

関連研究については、論文のセクション 6 でレビューしています。また、不正行為については以下の FAQ およびセクション 4.1 でも議論しています。

あなたのエージェントのスキャフォールド（骨組み）はすべてのタスクを解決するのに十分ですか？

エージェントは不正行為できるのか？

なぜ、またどのようにして逆コンパイルをブロックするのか？

リーダーボードはどのようにソートされているのか？主要な指標は何ですか？

どのようにしてリーダーボードに提出すればよいですか？

公開の提出ポータルは近日中に開設されます。詳細は著者 (John, Kilian) の更新情報をご確認ください。

なぜインターネットへのアクセスを許可しないのですか？

ただし、不正行為が含まれていない事例においては、インターネットへのアクセスを許可してもスコアの劇的な向上は見られませんでした。

詳細なアブレーション実験については、論文のセクション 4.1 および John の解説をご覧ください。

他の指標は検討されましたか？例えば、平均して通過するテストの数などです。

[junegunn/fzf

:cherry_blossom: コマンドライン用ファジーファインダー

79,721

Best score: 82%](https://programbench.com/task/junegunn__fzf.b56d614/)

[jesseduffield/lazygit

git コマンド用のシンプルなターミナル UI

76,901

Best score: 56%](https://programbench.com/task/jesseduffield__lazygit.1d0db51/)

[BurntSushi/ripgrep

ripgrep は、gitignore を尊重しながら再帰的にディレクトリを検索し、正規表現パターンを探します

62,855

Best score: 80%](https://programbench.com/task/burntsushi__ripgrep.3b7fd44/)

[FFmpeg/FFmpeg

https://git.ffmpeg.org/ffmpeg.git のミラー

59,217

Best score: 5%](https://programbench.com/task/ffmpeg__ffmpeg.360a402/)

[sharkdp/bat

拡張機能付きの cat(1) クローン。

58,487

Best score: 33%](https://programbench.com/task/sharkdp__bat.f822bd0/)

[typst/typst

強力でありながら学習が容易な、マークアップベースの組版システム。

52,957

Best score: 28%](https://programbench.com/task/typst__typst.88356d0/)

[jgm/pandoc

ユニバーサルマークアップコンバーター

43,632

Best score: 14%](https://programbench.com/task/jgm__pandoc.5caad90/)

[sharkdp/fd

'find' のシンプルで高速、かつユーザーフレンドリーな代替手段

42,668

Best score: 78%](https://programbench.com/task/sharkdp__fd.40d8eb3/)

[php/php-src

PHP インタープリタ

40,030

Best score: 5%](https://programbench.com/task/php__php-src.c891263/)

[duckdb/duckdb

DuckDB は、分析用インプロセス SQL データベース管理システムです

37,657

cpp

Best score: 12%](https://programbench.com/task/duckdb__duckdb.bdb65ec/)

[ajeetdsouza/zoxide

より賢い cd コマンド。主要なすべてのシェルをサポートします。

35,994

Best score: 76%](https://programbench.com/task/ajeetdsouza__zoxide.67ca1bc/)

[jqlang/jq

コマンドライン用 JSON プロセッサ

34,541

Best score: 90%](https://programbench.com/task/jqlang__jq.b33a763/)

[dandavison/delta

git、diff、grep、rg --json、および blame の出力に対する構文強調表示ページャー

30,445

Best score: 37%](https://programbench.com/task/dandavison__delta.acd758f/)

[sharkdp/hyperfine

コマンドラインベンチマークツール

27,960

Best score: 54%](https://programbench.com/task/sharkdp__hyperfine.327d5f4/)

[ggreer/the_silver_searcher

ack に似たコード検索ツールですが、より高速です。

27,080

Best score: 59%](https://programbench.com/task/ggreer__the_silver_searcher.a61f178/)

[facebook/zstd

Zstandard - 高速なリアルタイム圧縮アルゴリズム

27,013

Best score: 69%](https://programbench.com/task/facebook__zstd.1168da0/)

[facebookresearch/fastText

テキスト表現と分類のための高速ライブラリ。

26,511

cpp

Best score: 76%](https://programbench.com/task/facebookresearch__fasttext.1142dc4/)

[robertdavidgraham/masscan

TCP ポートスキャナ。非同期で SYN パケットを大量に送信し、5 分未満でインターネット全体をスキャンします。

25,544

Best score: 57%](https://programbench.com/task/robertdavidgraham__masscan.b99d433/)

[tree-sitter/tree-sitter

プログラミングツール向けの増分構文解析システム

24,953

Best score: 37%](https://programbench.com/task/tree-sitter__tree-sitter.5e23cca/)

[FiloSottile/age

22,077

Best score: 63%](https://programbench.com/task/filosottile__age.706dfc1/)

[rust-lang/mdBook

マークダウンファイルから本を作成します。Gitbook のようなものですが、Rust で実装されています。

21,541

Best score: 55%](https://programbench.com/task/rust-lang__mdbook.37273ba/)

[jarun/nnn

n³ 非伝統的なターミナルファイルマネージャー

21,506

Best score: 98%](https://programbench.com/task/jarun__nnn.cb2c535/)

[antonmedv/fx

ターミナル用 JSON ビューア & プロセッサ

20,433

Best score: 76%](https://programbench.com/task/antonmedv__fx.86d0d34/)

[mikefarah/yq

yq は、ポータブルなコマンドライン YAML、JSON、XML、CSV、TOML、HCL およびプロパティプロセッサです

15,281

Best score: 39%](https://programbench.com/task/mikefarah__yq.602586d/)

[Y2Z/monolith

⬛️ 完全な Web ページを単一の HTML ファイルとして保存するための CLI ツールおよびライブラリ

15,024

Best score: 51%](https://programbench.com/task/y2z__monolith.8702e66/)

原文を表示

ProgramBench logo

Can language models rebuild programs from scratch?

Given only a compiled binary and its documentation, agents must architect and implement

a complete codebase that reproduces the original program's behavior.

Model

Agent

Resolved

help_outline

Almost resolvedAlmost

help_outline

Instances where the agent's solution solves ≥ 95% of all behavioral tests. See extended results.

Claude Opus 4.7

Anthropic

mini-SWE-agent

3.0%

Claude Opus 4.6

Anthropic

mini-SWE-agent

2.5%

Claude Sonnet 4.6

Anthropic

mini-SWE-agent

1.0%

GPT 5.4

OpenAI

mini-SWE-agent

0.0%

Gemini 3.1 Pro

Google

mini-SWE-agent

0.0%

Gemini 3 Flash

Google

mini-SWE-agent

0.0%

Claude Haiku 4.5

Anthropic

mini-SWE-agent

0.0%

GPT 5.4 mini

OpenAI

mini-SWE-agent

0.0%

GPT 5 mini

OpenAI

mini-SWE-agent

0.0%

About ProgramBench

The agent must choose a language, design the architecture, write all source code, and produce a build script. Every design decision is the model's to make.

Once the agent submits a program, our test suite compares the candidate program's behavior against the original program. A candidate program passes only if all tests for that task pass.

Our test suite is generated via agent-driven fuzzing, and it comprises more than 248,000 total behavioral tests for our 200 tasks.

Can tasks in ProgramBench be fully solved at all?

Why are ProgramBench scores so low?

No decompilation. See "Can tasks be solved with decompilation?"

We review related work in section 6 of the paper. We also discuss cheating in the FAQ below and in section 4.1.

Is your agent scaffold sufficient to solve all tasks?

Can agents cheat?

Why and how do you block decompilation?

We do this because we want ProgramBench to answer the question "How well can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".

How is the leaderboard sorted? What's the primary metric?

For a detailed understanding of model performance, we recommend the plot at the detailed leaderboard. See also: "Have you considered other metrics?"

How do I submit to the leaderboard?

A public submission portal is coming soon. Follow the authors (John, Kilian) for updates.

Why do you not allow internet?

However, except for instances that contained cheating, we did not observe a dramatic improvement of scores when allowing internet.

Find our ablations in section 4.1 of the paper and John's explanation.

Have you considered other metrics? E.g., average number of tests passed?

[junegunn/fzf

:cherry_blossom: A command-line fuzzy finder

79,721

Best score: 82%](https://programbench.com/task/junegunn__fzf.b56d614/)

[jesseduffield/lazygit

simple terminal UI for git commands

76,901

Best score: 56%](https://programbench.com/task/jesseduffield__lazygit.1d0db51/)

[BurntSushi/ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

62,855

Best score: 80%](https://programbench.com/task/burntsushi__ripgrep.3b7fd44/)

[FFmpeg/FFmpeg

Mirror of https://git.ffmpeg.org/ffmpeg.git

59,217

Best score: 5%](https://programbench.com/task/ffmpeg__ffmpeg.360a402/)

[sharkdp/bat

A cat(1) clone with wings.

58,487

Best score: 33%](https://programbench.com/task/sharkdp__bat.f822bd0/)

[typst/typst

A markup-based typesetting system that is powerful and easy to learn.

52,957

Best score: 28%](https://programbench.com/task/typst__typst.88356d0/)

[jgm/pandoc

Universal markup converter

43,632

Best score: 14%](https://programbench.com/task/jgm__pandoc.5caad90/)

[sharkdp/fd

A simple, fast and user-friendly alternative to 'find'

42,668

Best score: 78%](https://programbench.com/task/sharkdp__fd.40d8eb3/)

[php/php-src

The PHP Interpreter

40,030

Best score: 5%](https://programbench.com/task/php__php-src.c891263/)

[duckdb/duckdb

DuckDB is an analytical in-process SQL database management system

37,657

cpp

Best score: 12%](https://programbench.com/task/duckdb__duckdb.bdb65ec/)

[ajeetdsouza/zoxide

A smarter cd command. Supports all major shells.

35,994

Best score: 76%](https://programbench.com/task/ajeetdsouza__zoxide.67ca1bc/)

[jqlang/jq

Command-line JSON processor

34,541

Best score: 90%](https://programbench.com/task/jqlang__jq.b33a763/)

[dandavison/delta

A syntax-highlighting pager for git, diff, grep, rg --json, and blame output

30,445

Best score: 37%](https://programbench.com/task/dandavison__delta.acd758f/)

[sharkdp/hyperfine

A command-line benchmarking tool

27,960

Best score: 54%](https://programbench.com/task/sharkdp__hyperfine.327d5f4/)

[ggreer/the_silver_searcher

A code-searching tool similar to ack, but faster.

27,080

Best score: 59%](https://programbench.com/task/ggreer__the_silver_searcher.a61f178/)

[facebook/zstd

Zstandard - Fast real-time compression algorithm

27,013

Best score: 69%](https://programbench.com/task/facebook__zstd.1168da0/)

[facebookresearch/fastText

Library for fast text representation and classification.

26,511

cpp

Best score: 76%](https://programbench.com/task/facebookresearch__fasttext.1142dc4/)

[robertdavidgraham/masscan

TCP port scanner, spews SYN packets asynchronously, scanning entire Internet in under 5 minutes.

25,544

Best score: 57%](https://programbench.com/task/robertdavidgraham__masscan.b99d433/)

[tree-sitter/tree-sitter

An incremental parsing system for programming tools

24,953

Best score: 37%](https://programbench.com/task/tree-sitter__tree-sitter.5e23cca/)

[FiloSottile/age

A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.

22,077

Best score: 63%](https://programbench.com/task/filosottile__age.706dfc1/)

[rust-lang/mdBook

Create book from markdown files. Like Gitbook but implemented in Rust

21,541

Best score: 55%](https://programbench.com/task/rust-lang__mdbook.37273ba/)

[jarun/nnn

n³ The unorthodox terminal file manager

21,506

Best score: 98%](https://programbench.com/task/jarun__nnn.cb2c535/)

[antonmedv/fx

Terminal JSON viewer & processor

20,433

Best score: 76%](https://programbench.com/task/antonmedv__fx.86d0d34/)

[mikefarah/yq

yq is a portable command-line YAML, JSON, XML, CSV, TOML, HCL and properties processor

15,281

Best score: 39%](https://programbench.com/task/mikefarah__yq.602586d/)

[Y2Z/monolith

⬛️ CLI tool and library for saving complete web pages as a single HTML file

15,024

Best score: 51%](https://programbench.com/task/y2z__monolith.8702e66/)

この記事をシェア

GitHub Blog重要度42026年6月26日 07:59

GitHub Copilot エージェント型ハッチのモデル・タスク間での性能と効率の評価

TLDR AI2026年6月26日 09:00

研究科学者の就職活動から得た驚くべき教訓（11 分読）

TLDR AI2026年6月26日 09:00

ツール使用型 LLM エージェントの脆弱性評価手法「RHB」を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

ProgramBench：ソースコードなしでソフトウェアを再現するエージェント評価ベンチマーク

キーポイント

影響分析

編集コメント

ProgramBench について

About ProgramBench

関連記事

ProgramBench：ソースコードなしでソフトウェアを再現するエージェント評価ベンチマーク

キーポイント

影響分析

編集コメント

ProgramBench について

About ProgramBench

関連記事