TLDR AI·2026年6月30日 09:00·約2分

RoadmapBench：バージョンアップを跨ぐ長期的エージェント型ソフトウェア開発の評価

#Agentic AI #Software Engineering #Benchmarking #LLM Evaluation

TL;DR

研究者らが、AI エージェントがバージョンアップを跨ぐ長期的なソフトウェア開発タスクを遂行する能力を評価するための新しいベンチマーク「RoadmapBench」を発表しました。

AI深層分析2026年7月1日 00:05

重要/ 5段階

深度40%

キーポイント

長期的・継続的開発の評価

既存のベンチマークが単発タスクに焦点を当てる中、本研究はバージョンアップやコードリファクタリングなど時間軸を跨ぐ複雑な開発プロセスを評価対象としています。

エージェント型開発の限界解明

大規模言語モデル（LLM）ベースのエージェントが、長期間にわたる一貫性を保ちながらソフトウェアを維持・拡張できる能力を定量的に測定する枠組みを提供します。

実社会での信頼性向上

単発のコード生成ではなく、実際の開発ライフサイクルにおける持続的な自律性を検証することで、産業応用に向けた信頼性の基準を確立することを目指しています。

影響分析・編集コメントを表示

影響分析

本ニュースは、AI エージェントが単発のコード生成を超え、実際の開発現場で長期間にわたり自律的に動作する能力を評価する基準を確立した点で重要です。これにより、研究コミュニティと産業界は、AI の実用化における「持続性」という新たな課題に取り組むための共通言語を得ることになります。

編集コメント

単発のコード生成能力が評価される昨今、長期的な開発プロセスにおけるAI の信頼性を測る指標が登場したことは、自律型エンジニアの実現に向けた重要な一歩です。

著者：Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

PDF を表示

HTML（実験版）

要約：コーディングエージェントは、単一のバージョン反復に多くのファイルにわたる数か月の調整された作業を必要とする実際のソフトウェア開発においてますます展開されています。しかし、既存のベンチマークのほとんどは、Python リポジトリからの単一問題のバグ修正に主に焦点を当てており、粗い合格/不合格の評価結果であり、したがって、実際のエンジニアリング規模における長期的で多目標の開発を捉えることができていません。このギャップに対処するため、私たちは RoadmapBench を提示します。これは、17 のリポジトリと 5 つのプログラミング言語にわたる実際のオープンソースバージョンアップグレードに基づいた、115 の長期的コーディングタスクからなるベンチマークです。各タスクでは、エージェントをソースバージョンのコードスナップショット上に配置し、ターゲットバージョンで導入された機能を実装することを要求する多目標ロードマップ指示を提供します。これは、51 ファイルにわたる 3,700 行の修正が中央値となります。私たちは 13 の最先端モデルに対して体系的な評価を行い、最も強力な Claude-Opus-4.7 でさえタスクの 39.1% しか解決できない一方、最も弱いモデルはわずか 5.2% しか達成できないことを発見しました。これは既存のバグ修正ベンチマークとは対照的であり、長期的なソフトウェア開発が依然として未解決の問題である大部分を占めていることを示唆しています。

コメント：30 ページ、15 図

対象分野：

ソフトウェアエンジニアリング (cs.SE); 人工知能 (cs.AI)

引用形式：

arXiv:2605.15846 [cs.SE]

(または、このバージョンについては arXiv:2605.15846v2 [cs.SE] を参照してください)

https://doi.org/10.48550/arXiv.2605.15846

arXiv 発行 DOI (DataCite 経由)

提出履歴

送信者: Xinbo Xu [メールを表示]

[v1]**

2026 年 5 月 15 日 (金) 11:00:33 UTC (4,936 KB)**

[v2]**

2026 年 5 月 19 日 (火) 08:10:44 UTC (4,935 KB)

原文を表示

Authors:Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

View PDF

HTML (experimental)

Abstract:Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

Comments:

30 pages, 15 figures

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as:

arXiv:2605.15846 [cs.SE]

(or

arXiv:2605.15846v2 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2605.15846

arXiv-issued DOI via DataCite

Submission history

From: Xinbo Xu [view email] [[v1]](https://arxiv.org/abs/2605.15846v1)

Fri, 15 May 2026 11:00:33 UTC (4,936 KB)**

[v2]**

Tue, 19 May 2026 08:10:44 UTC (4,935 KB)

この記事をシェア

MIT ML News重要度42026年7月1日 00:30

Q&A：現在のエージェント型 AI とあるべき姿とは何か

TechCrunch AI重要度42026年6月30日 02:39

誰もが利用する AI リーダーボード「Arena」が売上100億ドルの事業に成長

AWS Machine Learning Blog重要度42026年6月30日 02:36

Amazon Bedrock と AWS HealthLake を活用したエージェント型 AI 医療請求処理パイプラインの構築方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月30日 09:00·約2分

RoadmapBench：バージョンアップを跨ぐ長期的エージェント型ソフトウェア開発の評価

#Agentic AI #Software Engineering #Benchmarking #LLM Evaluation

TL;DR

AI深層分析2026年7月1日 00:05

重要/ 5段階

深度40%

キーポイント

長期的・継続的開発の評価

エージェント型開発の限界解明

実社会での信頼性向上

影響分析・編集コメントを表示

影響分析

編集コメント

PDF を表示

HTML（実験版）

コメント：30 ページ、15 図

対象分野：

ソフトウェアエンジニアリング (cs.SE); 人工知能 (cs.AI)

引用形式：

arXiv:2605.15846 [cs.SE]

(または、このバージョンについては arXiv:2605.15846v2 [cs.SE] を参照してください)

https://doi.org/10.48550/arXiv.2605.15846

arXiv 発行 DOI (DataCite 経由)

提出履歴

送信者: Xinbo Xu [メールを表示]

[v1]**

2026 年 5 月 15 日 (金) 11:00:33 UTC (4,936 KB)**

[v2]**

2026 年 5 月 19 日 (火) 08:10:44 UTC (4,935 KB)

原文を表示

View PDF

HTML (experimental)

Abstract:Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

Comments:

30 pages, 15 figures

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as:

arXiv:2605.15846 [cs.SE]

(or

arXiv:2605.15846v2 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2605.15846

arXiv-issued DOI via DataCite