Hugging Face Blog·2026年7月1日 03:32·約9分

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

#Reasoning #Agent #Hugging Face #Java #Benchmark

TL;DR

Hugging Face は、エンタープライズ環境における Java フレームワーク移行タスクを遂行する AI エージェントの性能を評価するための専用ベンチマーク「ScarfBench」を発表した。

AI深層分析2026年7月1日 04:04

重要/ 5段階

深度40%

キーポイント

ScarfBench の発表と目的

Hugging Face が、AI エージェントが複雑なエンタープライズ Java アプリケーションの移行をどの程度正確かつ効率的に実行できるかを測定するための新しいベンチマーク「ScarfBench」を発表した。

実世界タスクへの焦点

単なるコード生成ではなく、実際の業務システムに見られる複雑な依存関係やレガシーコードの扱いを含む、現実的な Java フレームワーク移行シナリオに特化して設計されている。

AI エージェント能力の定量化

このベンチマークを通じて、現在の AI エージェントが自律的に課題を定義し、計画を立て、実行する能力（Reasoning and Planning）を客観的に評価・比較可能となる。

影響分析・編集コメントを表示

影響分析

この発表は、生成 AI が単なるコード補完から、複雑なシステム移行を実行する自律エージェントへと進化している過程を可視化する重要なマイルストーンです。企業におけるレガシーシステムの刷新プロセスにおいて、AI の導入可能性やリスク評価を定量的に行えるようになるため、開発現場の意思決定に大きな影響を与える可能性があります。

編集コメント

AI エージェントの実用化において最も課題となる「複雑な文脈理解と実行」を、具体的な業務タスクで評価できる点は非常に画期的です。

記事一覧に戻る

GitHub で ScarfBench にスターを付ける

エンタープライズアプリケーションの近代化は、組織が取り組む中で最も規模が大きく、コストのかかるソフトウェアエンジニアリング活動の一つです。チームは、保守性の向上、クラウド対応力の強化、開発者生産性の向上、および現代的な機能へのアクセスを目的として、アプリケーションを異なるフレームワーク間で移行します。

コーディングエージェントの最近の進展により、AI 支援による近代化に対する期待が高まっています。しかし、重要な疑問が残されています：

AI エージェントは、現実世界のエンタープライズアプリケーションを確実に近代化できるのでしょうか？

既存のソフトウェアエンジニアリングベンチマークは、バグ修正やコード生成において印象的な進歩を示してきましたが、フレームワーク間の移行は根本的に異なる課題を提示します。成功には、単なるコードの変換だけでなく、動作の維持、ビルドシステムの適応、およびランタイム依存関係の処理が必要です。

このギャップに対処するため、私たちはScarfBench（自己完結型アプリケーションリファクタリングベンチマーク）を発表しました。これは、エンタープライズ Java におけるクロスフレームワーク移行タスクにおいて AI エージェントを評価するためのオープンなベンチマークです。

ScarfBench は、3 つの主要な Java エコシステム間での移行に焦点を当てています：

Spring

Jakarta EE

Quarkus

生成されたコードを参照実装と比較する従来のベンチマークとは異なり、ScarfBench は、移行されたアプリケーションが実際にビルド可能か、デプロイ可能か、そして動作が維持されているかを評価します。

なぜ移行は難しいのか

フレームワークの移行は、アノテーションを置き換えるだけではありません。

単純なリポジトリの移行でも、依存性注入、永続化設定、クエリ、およびフレームワーク記述子全体にわたる変更が必要になる場合があります。これらのいずれかの部分で小さなミスがあっても、成功したデプロイが妨げられる可能性があります。

図：Spring → Jakarta 移行例

フレームワークの移行には、ソースコードだけでなく、フレームワークの意味論の翻訳が必要です。

ScarfBench の紹介

ScarfBench は、エンタープライズ Java フレームワークの移行タスクにおける AI エージェントを体系的に評価するための手段を提供します。

アプリケーションには以下の要件が求められます：

正常にビルドされること。

正しくデプロイされること。

動作検証に合格すること。

これにより、近代化の品質に対するはるかに現実的な測定が可能になります。

ベンチマークの概観

メトリック

値

アプリケーション数34

フレームワーク実装102

移行タスク204

コード行数約151K

ソースおよびテストファイル約2,000

専門家作成のテスト1,331

ScarfBench には、焦点を絞った移行タスクと、アプリケーション全体の移行の両方が含まれています。

図：ScarfBench 構築パイプライン

JSR ベースのエンタープライズ Java タクソノミーから出発し、専門家の移行によって、Spring、Jakarta EE、Quarkus にわたる検証済みの実装が作成されます。

最先端エージェントのパフォーマンスは？

私たちは、ScarfBench においていくつかの最先端のコーディングエージェントを評価しました。

従来のソフトウェアエンジニアリングベンチマークにおいて強力なパフォーマンスを発揮しているにもかかわらず、フレームワークの移行は依然として困難です。成功率はフレームワークペアによって大きく異なり、アプリケーション全体の移行は特に挑戦的な課題となっています。

Figure: Current Leaderboard

現在の最も強力なエージェントでさえ、動作面での成功率が 10% に満たないことから、コンパイル可能なコードを生成することと、アプリケーションの振る舞いを維持することの間には大きな隔たりがあることが示されています。

Figure: Compile → Deploy → Test Progression

コンパイルの成功はデプロイの成功を上回り、デプロイの成功は動作面の成功を上回ります。ビルドの成功のみを評価すると、移行の品質を著しく過大評価することになります。

Figure: Migration Outcomes by Target Framework

移行の難易度は対象フレームワークに強く依存しており、特に Jakarta EE への移行が困難であることが証明されています。

What We Learned About AI Agents for Java Modernization

成功率を測定するだけでなく、ScarfBench は現代化プロセスにおけるエージェントの振る舞いについて理解する手助けもしてくれます。

Can Agents Reliably Tell When a Migration Is Complete?

移行されたアプリケーションが実際にビルドできて実行可能でなければ、有用なものではありません。

そのため、エージェントが報告した結果と、独立したビルド検証の結果を比較しました。

Finding: Agents Are Overconfident

Claude Code は、30 のアプリケーション全体のうち 29 で成功したビルドを報告しました。

そのうち実際に正常にビルドできたのは 22 のアプリケーションのみでした。

一方、エージェントによって失敗と分類された単一のアプリケーションは、最終的に正しくビルドされました。

これは、エージェントの自己評価を移行完了の信頼できる指標として扱うべきではないことを示唆しています。

独立したビルドおよびテストによる検証は依然として不可欠です。

エージェントはアプリケーション依存関係をどのようにナビゲートするのか？

フレームワーク移行は、単一のファイルや層に影響を与えるだけではありません。

設定、サービス、データベース、Web コンポーネントの変更は、アプリケーション全体にカスケード（連鎖）することがよくあります。

発見：移行は線形ではなく反復的である

最も頻繁に訪問された層は以下の通りです：

設定 (Configuration)
Web
データベース
サービス

一般的な遷移には以下が含まれます：

設定 ↔ Web
サービス ↔ データベース

これは、移行が単純なソースからソースへの変換ではなく、反復的な依存関係解決プロセスであることを示唆しています。

エージェントは努力の大部分をどこに費やしているのか？

私たちは、層の再訪問頻度を移行にかかる労力の代理指標として使用しました。繰り返し訪問を必要とする層には、通常、デバッグ、依存関係の解決、またはフレームワーク適応が伴います。

発見：設定が移行努力を支配する

線形に進むのではなく、エージェントはフレームワークの違いや依存関係の問題を解決する際に、設定に関連するアーティファクトに繰り返し戻りました。

コード変換に関係しない課題とは何か？

すべての移行問題がソースコードに起因するわけではありません。

発見：環境とツールチェーンが重要である

エージェントは頻繁に以下の環境に関する問題で苦労しました：

Docker キャッシュの不整合
ポート接続の問題
Maven ワッパーおよびビルドツールチェーンの問題

これらの運用上の懸念は、ソースコードの移行自体がほぼ完了していた場合でも、検証を遅らせる要因となることがよくあります。

Figure: Failure Mode Distribution

近代化の失敗は、ビルドシステム、デプロイ環境、依存性注入（Dependency Injection）、データベース、エンドポイント、アサーション、インフラストラクチャにまたがって発生します。

Key Takeaway

フレームワークの近代化における最大の課題は、Java コードを翻訳することではありません。

それは、設定、インフラストラクチャ、ランタイム環境にわたる依存関係の複雑な網を管理することです。

最先端のエージェントが移行プロセスの大部分を自動化できる一方で、成功した結果を実現するためには、信頼性の高い検証とアーキテクチャ推論が依然として不可欠です。

ScarfBench はこれらの課題を明らかにし、真に自律的なアプリケーション近代化に向けた進捗を測定するための標準化された手段を提供します。

Explore ScarfBench

ScarfBench は、研究者および実務家のためのオープンリソースとして設計されています。

提供されるリソースには以下が含まれます:

ベンチマークデータセット
評価インフラストラクチャ
パブリックリーダーボード
ドキュメント
オープンソースコード

研究者はエージェントアーキテクチャや手法を比較できます。実務家は、本番環境への導入前に ScarfBench を用いて近代化ソリューションを評価することができます。

フレームワークの移行は、AI を支援したソフトウェア工学における未解決問題の中で最も大きな課題の一つです。ScarfBench が、コミュニティが進捗を測定し、次世代の AI 支援アプリケーション近代化を加速する手助けとなることを願っています。

研究者、実務家、およびフレームワークコミュニティの皆様には、自らのエージェントの評価への参加、新しい移行シナリオの提供、そして最先端技術の進展への貢献を呼びかけます。

原文を表示

Back to Articles

Star ScarfBench on GitHub

Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities.

Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:

Can AI agents reliably modernize real-world enterprise applications?

Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.

To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.

ScarfBench focuses on migrations across three major Java ecosystems:

Spring

Jakarta EE

Quarkus

Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.

Why Migration Is Hard

Framework migration is much more than replacing annotations.

A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment.

Figure: Spring → Jakarta Migration Example

Framework migration requires translating framework semantics, not just source code.

Introducing ScarfBench

ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.

Applications are required to:

Build successfully.

Deploy correctly.

Pass behavioral validation.

This provides a much more realistic measure of modernization quality.

Benchmark at a Glance

Metric

Value

Applications34

Framework implementations102

Migration tasks204

Lines of code~151K

Source and test files~2,000

Expert-written tests1,331

ScarfBench includes both focused migration tasks and whole-application migrations.

Figure: ScarfBench Construction Pipeline

Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.

How Do Frontier Agents Perform?

We evaluated several state-of-the-art coding agents on ScarfBench.

Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging.

Figure: Current Leaderboard

Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior.

Figure: Compile → Deploy → Test Progression

Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality.

Figure: Migration Outcomes by Target Framework

Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.

What We Learned About AI Agents for Java Modernization

Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization.

Can Agents Reliably Tell When a Migration Is Complete?

A migrated application is only useful if it actually builds and runs.

We therefore compared agent-reported outcomes against independent build verification.

Finding: Agents Are Overconfident

Claude Code reported successful builds for 29 out of 30 whole applications.

Only 22 of those applications actually built successfully.

Meanwhile, the single application classified as failed by the agent ultimately built correctly.

This suggests that agent self-assessment should not be treated as a reliable signal of migration completion.

Independent build and test validation remains essential.

How Do Agents Navigate Application Dependencies?

Framework migrations rarely affect a single file or layer.

Changes in configuration, services, databases, and web components often cascade across the application.

Finding: Migration Is Iterative Rather Than Linear

The most frequently visited layers were:

Configuration

Database

Service

Common transitions included:

Configuration ↔ Web

Service ↔ Database

This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation.

Where Do Agents Spend Most of Their Effort?

We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation.

Finding: Configuration Dominates Migration Effort

Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues.

What Challenges Are Not About Code Transformation?

Not every migration issue originates from source code.

Finding: Environment and Tooling Matter

Agents frequently struggled with environmental issues, including:

Docker cache inconsistencies

Port connectivity problems

Maven wrapper and build tooling issues

These operational concerns often delayed validation even when the source-code migration itself was largely complete.

Figure: Failure Mode Distribution

Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure.

Key Takeaway

The biggest challenge in framework modernization is not translating Java code.

It is managing the web of dependencies across configuration, infrastructure, and runtime environments.

While frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes.

ScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization.

Explore ScarfBench

ScarfBench is designed as an open resource for researchers and practitioners.

Resources include:

Benchmark dataset

Evaluation infrastructure

Public leaderboard

Documentation

Open-source code

Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments.

Website

https://scarfbench.info

Dataset

https://huggingface.co/datasets/ibm-research/ScarfBench

Space

https://huggingface.co/spaces/ibm-research/ScarfBench

GitHub Repository

https://github.com/scarfbench/scarfbench

Leaderboard

https://scarfbench.info/leaderboard

Paper

https://arxiv.org/abs/2605.06754

Framework migration remains one of the largest unsolved problems in AI-assisted software engineering. We hope ScarfBench helps the community measure progress and accelerate the next generation of AI-assisted application modernization.

We invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art.

この記事をシェア

TLDR AI2026年7月3日 09:00

AI 向けラマヌジャン・チャレンジ（1 分読了）

KDnuggets2026年7月2日 21:00

人類最後の試験は気晴らしである

LangChain Blog2026年7月3日 02:29

コーディングエージェントの利用料金が倍増。その対策とは

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年7月1日 03:32·約9分

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

#Reasoning #Agent #Hugging Face #Java #Benchmark

TL;DR

AI深層分析2026年7月1日 04:04

重要/ 5段階

深度40%

キーポイント

ScarfBench の発表と目的

実世界タスクへの焦点

AI エージェント能力の定量化

影響分析・編集コメントを表示

影響分析

編集コメント

AI エージェントの実用化において最も課題となる「複雑な文脈理解と実行」を、具体的な業務タスクで評価できる点は非常に画期的です。

記事一覧に戻る

GitHub で ScarfBench にスターを付ける

コーディングエージェントの最近の進展により、AI 支援による近代化に対する期待が高まっています。しかし、重要な疑問が残されています：

AI エージェントは、現実世界のエンタープライズアプリケーションを確実に近代化できるのでしょうか？

ScarfBench は、3 つの主要な Java エコシステム間での移行に焦点を当てています：

Spring

Jakarta EE

Quarkus

なぜ移行は難しいのか

フレームワークの移行は、アノテーションを置き換えるだけではありません。

図：Spring → Jakarta 移行例

フレームワークの移行には、ソースコードだけでなく、フレームワークの意味論の翻訳が必要です。

ScarfBench の紹介

ScarfBench は、エンタープライズ Java フレームワークの移行タスクにおける AI エージェントを体系的に評価するための手段を提供します。

アプリケーションには以下の要件が求められます：

正常にビルドされること。

正しくデプロイされること。

動作検証に合格すること。

これにより、近代化の品質に対するはるかに現実的な測定が可能になります。

ベンチマークの概観

メトリック

値

アプリケーション数34

フレームワーク実装102

移行タスク204

コード行数約151K

ソースおよびテストファイル約2,000

専門家作成のテスト1,331

ScarfBench には、焦点を絞った移行タスクと、アプリケーション全体の移行の両方が含まれています。

図：ScarfBench 構築パイプライン

最先端エージェントのパフォーマンスは？

私たちは、ScarfBench においていくつかの最先端のコーディングエージェントを評価しました。

Figure: Current Leaderboard

Figure: Compile → Deploy → Test Progression

Figure: Migration Outcomes by Target Framework

移行の難易度は対象フレームワークに強く依存しており、特に Jakarta EE への移行が困難であることが証明されています。

What We Learned About AI Agents for Java Modernization

成功率を測定するだけでなく、ScarfBench は現代化プロセスにおけるエージェントの振る舞いについて理解する手助けもしてくれます。

Can Agents Reliably Tell When a Migration Is Complete?

移行されたアプリケーションが実際にビルドできて実行可能でなければ、有用なものではありません。

そのため、エージェントが報告した結果と、独立したビルド検証の結果を比較しました。

Finding: Agents Are Overconfident

Claude Code は、30 のアプリケーション全体のうち 29 で成功したビルドを報告しました。

そのうち実際に正常にビルドできたのは 22 のアプリケーションのみでした。

一方、エージェントによって失敗と分類された単一のアプリケーションは、最終的に正しくビルドされました。

これは、エージェントの自己評価を移行完了の信頼できる指標として扱うべきではないことを示唆しています。

独立したビルドおよびテストによる検証は依然として不可欠です。

エージェントはアプリケーション依存関係をどのようにナビゲートするのか？

フレームワーク移行は、単一のファイルや層に影響を与えるだけではありません。

設定、サービス、データベース、Web コンポーネントの変更は、アプリケーション全体にカスケード（連鎖）することがよくあります。

発見：移行は線形ではなく反復的である

最も頻繁に訪問された層は以下の通りです：

設定 (Configuration)
Web
データベース
サービス

一般的な遷移には以下が含まれます：

設定 ↔ Web
サービス ↔ データベース

これは、移行が単純なソースからソースへの変換ではなく、反復的な依存関係解決プロセスであることを示唆しています。

エージェントは努力の大部分をどこに費やしているのか？

発見：設定が移行努力を支配する

コード変換に関係しない課題とは何か？

すべての移行問題がソースコードに起因するわけではありません。

発見：環境とツールチェーンが重要である

エージェントは頻繁に以下の環境に関する問題で苦労しました：

Docker キャッシュの不整合
ポート接続の問題
Maven ワッパーおよびビルドツールチェーンの問題

これらの運用上の懸念は、ソースコードの移行自体がほぼ完了していた場合でも、検証を遅らせる要因となることがよくあります。

Figure: Failure Mode Distribution

Key Takeaway

フレームワークの近代化における最大の課題は、Java コードを翻訳することではありません。

それは、設定、インフラストラクチャ、ランタイム環境にわたる依存関係の複雑な網を管理することです。

ScarfBench はこれらの課題を明らかにし、真に自律的なアプリケーション近代化に向けた進捗を測定するための標準化された手段を提供します。

Explore ScarfBench

ScarfBench は、研究者および実務家のためのオープンリソースとして設計されています。

提供されるリソースには以下が含まれます:

ベンチマークデータセット
評価インフラストラクチャ
パブリックリーダーボード
ドキュメント
オープンソースコード

Website

https://scarfbench.info

Dataset

https://huggingface.co/datasets/ibm-research/ScarfBench

Space

https://huggingface.co/spaces/ibm-research/ScarfBench

GitHub リポジトリ

https://github.com/scarfbench/scarfbench

リーダーボード

https://scarfbench.info/leaderboard

論文

https://arxiv.org/abs/2605.06754

原文を表示

Back to Articles

Star ScarfBench on GitHub

Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:

Can AI agents reliably modernize real-world enterprise applications?

To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.

ScarfBench focuses on migrations across three major Java ecosystems:

Spring

Jakarta EE

Quarkus

Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.

Why Migration Is Hard

Framework migration is much more than replacing annotations.

Figure: Spring → Jakarta Migration Example

Framework migration requires translating framework semantics, not just source code.

Introducing ScarfBench

ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.

Applications are required to:

Build successfully.

Deploy correctly.

Pass behavioral validation.

This provides a much more realistic measure of modernization quality.

Benchmark at a Glance

Metric

Value

Applications34

Framework implementations102

Migration tasks204

Lines of code~151K

Source and test files~2,000

Expert-written tests1,331

ScarfBench includes both focused migration tasks and whole-application migrations.

Figure: ScarfBench Construction Pipeline

Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.

How Do Frontier Agents Perform?

We evaluated several state-of-the-art coding agents on ScarfBench.

Figure: Current Leaderboard

Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior.

Figure: Compile → Deploy → Test Progression

Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality.

Figure: Migration Outcomes by Target Framework

Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.

What We Learned About AI Agents for Java Modernization

Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization.

Can Agents Reliably Tell When a Migration Is Complete?

A migrated application is only useful if it actually builds and runs.

We therefore compared agent-reported outcomes against independent build verification.

Finding: Agents Are Overconfident

Claude Code reported successful builds for 29 out of 30 whole applications.

Only 22 of those applications actually built successfully.

Meanwhile, the single application classified as failed by the agent ultimately built correctly.

This suggests that agent self-assessment should not be treated as a reliable signal of migration completion.

Independent build and test validation remains essential.

How Do Agents Navigate Application Dependencies?

Framework migrations rarely affect a single file or layer.

Changes in configuration, services, databases, and web components often cascade across the application.

Finding: Migration Is Iterative Rather Than Linear

The most frequently visited layers were:

Configuration

Database

Service

Common transitions included:

Configuration ↔ Web

Service ↔ Database

This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation.

Where Do Agents Spend Most of Their Effort?

We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation.

Finding: Configuration Dominates Migration Effort

Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues.

What Challenges Are Not About Code Transformation?

Not every migration issue originates from source code.

Finding: Environment and Tooling Matter

Agents frequently struggled with environmental issues, including:

Docker cache inconsistencies

Port connectivity problems

Maven wrapper and build tooling issues

These operational concerns often delayed validation even when the source-code migration itself was largely complete.

Figure: Failure Mode Distribution

Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure.

Key Takeaway

The biggest challenge in framework modernization is not translating Java code.

It is managing the web of dependencies across configuration, infrastructure, and runtime environments.

While frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes.

ScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization.

Explore ScarfBench

ScarfBench is designed as an open resource for researchers and practitioners.

Resources include:

Benchmark dataset

Evaluation infrastructure

Public leaderboard

Documentation

Open-source code

Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments.

Website

https://scarfbench.info

Dataset

https://huggingface.co/datasets/ibm-research/ScarfBench

Space

https://huggingface.co/spaces/ibm-research/ScarfBench

GitHub Repository

https://github.com/scarfbench/scarfbench

Leaderboard

https://scarfbench.info/leaderboard

Paper

https://arxiv.org/abs/2605.06754

We invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art.

この記事をシェア

TLDR AI2026年7月3日 09:00

AI 向けラマヌジャン・チャレンジ（1 分読了）

KDnuggets2026年7月2日 21:00

人類最後の試験は気晴らしである

LangChain Blog2026年7月3日 02:29

コーディングエージェントの利用料金が倍増。その対策とは

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

なぜ移行は難しいのか

ScarfBench の紹介

ベンチマークの概観

最先端エージェントのパフォーマンスは？

What We Learned About AI Agents for Java Modernization

Can Agents Reliably Tell When a Migration Is Complete?

Finding: Agents Are Overconfident

エージェントはアプリケーション依存関係をどのようにナビゲートするのか？

発見：移行は線形ではなく反復的である

エージェントは努力の大部分をどこに費やしているのか？

発見：設定が移行努力を支配する

コード変換に関係しない課題とは何か？

発見：環境とツールチェーンが重要である

Key Takeaway

Explore ScarfBench

Website

Dataset

Space

GitHub リポジトリ

リーダーボード

論文

Why Migration Is Hard

Introducing ScarfBench

Benchmark at a Glance

How Do Frontier Agents Perform?

What We Learned About AI Agents for Java Modernization

Can Agents Reliably Tell When a Migration Is Complete?

Finding: Agents Are Overconfident

How Do Agents Navigate Application Dependencies?

Finding: Migration Is Iterative Rather Than Linear

Where Do Agents Spend Most of Their Effort?

Finding: Configuration Dominates Migration Effort

What Challenges Are Not About Code Transformation?

Finding: Environment and Tooling Matter

Key Takeaway

Explore ScarfBench

Website

Dataset

Space

GitHub Repository

Leaderboard

Paper

関連記事

キーポイント

影響分析

編集コメント

なぜ移行は難しいのか

ScarfBench の紹介

ベンチマークの概観

最先端エージェントのパフォーマンスは？

What We Learned About AI Agents for Java Modernization

Can Agents Reliably Tell When a Migration Is Complete?

Finding: Agents Are Overconfident

エージェントはアプリケーション依存関係をどのようにナビゲートするのか？

発見：移行は線形ではなく反復的である

エージェントは努力の大部分をどこに費やしているのか？

発見：設定が移行努力を支配する

コード変換に関係しない課題とは何か？

発見：環境とツールチェーンが重要である

Key Takeaway

Explore ScarfBench

Website

Dataset

Space

GitHub リポジトリ

リーダーボード

論文

Why Migration Is Hard

Introducing ScarfBench

Benchmark at a Glance

How Do Frontier Agents Perform?

What We Learned About AI Agents for Java Modernization

Can Agents Reliably Tell When a Migration Is Complete?

Finding: Agents Are Overconfident

How Do Agents Navigate Application Dependencies?

Finding: Migration Is Iterative Rather Than Linear

Where Do Agents Spend Most of Their Effort?