新研究、業界テストを通過したAI生成コードの半数は実際の開発者に拒否されると判明
研究機関METRによる新しい研究は、人気のあるSWE-benchベンチマークを通過したAI生成コードの約半分が、実際のプロジェクトメンテナーによって拒否されることを明らかにした。
キーポイント
ベンチマークと実運用の乖離
AIコード生成の評価で広く使われるSWE-benchベンチマークを通過したコードでも、実際の開発現場では約半数が採用されないという研究結果が示された。
研究主体と方法
この研究は研究機関METRによって実施され、AI生成コードの品質をベンチマーク評価と実際のプロジェクトメンテナーの判断を比較することで検証した。
AIコード品質の現実的評価
研究結果は、現在のAIコード生成ツールの出力が、形式的なテストを通過しても、実践的なコードレビューやメンテナンス性の観点では不十分である可能性を示唆している。
影響分析・編集コメントを表示
影響分析
この研究結果は、AIコード生成ツールの評価方法に根本的な見直しを迫る可能性がある。ベンチマークテストの限界を明らかにし、より実践的で文脈を考慮した評価基準の必要性を示唆しており、AI支援開発ツールの進化と採用に影響を与えるだろう。
編集コメント
AIコード生成ツールの評価が形式的なベンチマークに依存しすぎている現状に警鐘を鳴らす重要な研究。実務での適用可能性を考慮したより高度な評価基準の開発が急務である。

研究機関METRによる新たな研究によれば、広く利用されているSWE-benchベンチマークを通過したAIによるコードソリューションの約半数は、実際のプロジェクト管理者によって却下されるだろうとのことです。
本記事「業界テストを通過したAI生成コードの半数は、実際の開発者に拒否されるとの研究結果」は、The Decoderに最初に掲載されました。
原文を表示
A new study by the research organization METR suggests that the popular coding benchmark SWE-bench Verified significantly overestimates the real-world performance of AI agents. Roughly half of the solutions rated as "passed" would get rejected by actual project maintainers.
SWE-bench Verified is (or was) one of the most important benchmarks for AI-assisted software engineering. It measures whether AI agents can solve real programming problems from open-source projects, with an automated tester checking whether submitted code changes pass the associated tests. Companies like Anthropic and OpenAI regularly cite the results to show off their models' progress.
An investigation by METR now raises serious questions about the benchmark's validity. The research team—Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush—had four experienced developers who actively maintain three SWE-bench projects (scikit-learn, Sphinx, and pytest) review a total of 296 AI-generated code contributions. About half of the solutions that passed the automated tester would never get merged into the actual codebase by the project maintainers.
Automated tests only capture part of what makes code good
The study used AI-generated solutions from five models: Claude 3.5 Sonnet (Old), Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. The test runs came from Epoch AI's Benchmarking Hub. The maintainers didn't know whether a given solution came from a human or an AI.
To account for noise in human decision-making, the researchers also had the maintainers evaluate original human solutions that had actually shipped in the projects. Only 68 percent of these reference solutions got re-approved for adoption. All results were normalized to this baseline.
On average, the maintainer adoption rate lands about 24 percentage points below the SWE-bench score. METR says this difference is statistically significant. The rate of improvement over time also comes in about 9.6 percentage points lower per year when measured by human assessments, though the researchers themselves note this trend result is statistically weaker.
Pass rates from the automated grader (orange) are consistently and significantly higher than actual acceptance rates by maintainers (blue) across all tested models. | Image: METR
Many rejections come down to basic functional errors, not just style
The maintainers sorted their rejections into three buckets: poor code quality (bad style or failure to follow project standards), damage to existing code, and basic functional errors. According to the METR report, "a meaningful chunk" of rejections came down to basic functional errors, meaning the AI agents didn't actually fix the underlying problem, even though the automated tests passed.
The jump from Claude 3.5 Sonnet to Claude 3.7 Sonnet produced significantly higher pass rates, but it also led to more cases where maintainers flagged basic functional errors. From Claude 3.7 to Claude 4 Opus, the issues shifted from "test failed" to "just poor code quality." Claude 4.5 Sonnet mainly improved code quality. GPT-5 performed significantly worse than the Anthropic models here, according to the study.
Time horizon analysis reveals a sevenfold overestimation
The researchers also ran a time horizon analysis that converts benchmark scores into the human completion time at which a model hits a 50 percent success rate. The gap is stark: Claude 4.5 Sonnet reaches a time horizon of about 50 minutes according to the automated checker, but only about 8 minutes when scored by maintainers. The study notes these estimates are less stable because SWE-bench Verified has a relatively narrow task-duration range, requiring extrapolation to estimate the 50% threshold.
What METR describes here is something the field already knows: benchmarks are, at best, a proxy for real-world AI performance. Practical feedback matters more. Both sides of the argument exist; plenty of developers say AI agents genuinely help with coding, while others flag serious concerns about the quality of AI-generated code. The AI coding hype of the last four months or so also centers on newer models like GPT-5.2-Thinking and Claude Opus 4.5. These models weren't part of the test; the study only looked at weaker predecessors.
That said, if the results hold up, the gap is still remarkably large. SWE-bench Verified is considered the gold standard for evaluating AI coding agents. If half of the solutions it marks as "passed" can't survive a real human review, that undermines individual model comparisons and chips away at the foundation on which investors, companies, and developers are building their expectations for AI-assisted software development.
Important findings, but with clear limitations
The researchers are upfront that the study doesn't prove a fundamental capability ceiling for these models. They say it's "plausible" that better prompting strategies and more targeted instructions could narrow the gap between automated evaluation and human judgment.
The comparison also wasn't apples to apples. Human developers can typically respond to feedback and revise their code, but the AI agents only got one shot. The review setup wasn't fully realistic either: the maintainers didn't have automated testing tools available, the problems came from older project states, and testing requirements were deliberately relaxed to give the AI agents a better chance.
So the study isn't about the maximum capability of these models. What it does show is that taking benchmark scores at face value overestimates how useful AI agents actually are without additional tweaks and human feedback. The researchers suspect similar distortions show up in other benchmarks too.
Co-author Joel Becker, a researcher at METR, also puts his own results in perspective on X. His main takeaway isn't that AI agents are fundamentally useless. With AI capabilities doubling every three to six months, even two- to tenfold performance gaps get closed fast.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み