The Decoder·2026年3月8日 16:15·約1分

研究が明らかに：AIエージェントのベンチマークはコーディングに偏り、米国労働市場の92％を無視

#AIエージェント #ベンチマーク #労働市場 #研究 #評価バイアス #汎用AI

TL;DR

大規模研究によると、AIエージェントの開発と評価はプログラミングタスクにほぼ独占されており、米国労働市場の92%を占めるその他の業務領域が無視されていることが明らかになった。

AI深層分析2026年3月8日 17:40

注目/ 5段階

深度40%

キーポイント

評価の偏り

AIエージェントのベンチマークは、ほぼ完全にプログラミングタスクに集中しており、他の業務領域の評価が著しく欠如している。

労働市場との乖離

この偏りは、プログラミング以外の業務が米国労働市場の92%を占めるという現実と大きく乖離している。

研究の規模

この問題は大規模な研究によって指摘されており、業界全体の傾向として認識されている。

開発の方向性への警鐘

研究結果は、AIエージェント開発が実社会の多様な労働ニーズを反映していない可能性を示し、方向性の見直しを促している。

影響分析・編集コメントを表示

影響分析

この記事は、AIエージェント開発が特定の技術領域に偏重し、実社会の多様な労働ニーズを捉えきれていない現状を浮き彫りにした。業界の評価基準と研究開発の方向性を見直す必要性を提起し、より汎用的で実用的なAIエージェントの発展を促す可能性がある。

編集コメント

AI開発が「できること」の評価に偏り、社会で「必要なこと」への適用が軽視されている現状を鋭く指摘する記事。業界の自己満足的な開発サイクルからの脱却を促す内容だ。

image

大規模な研究により、AIエージェントの開発はほぼ専らプログラミングタスクに集中しており、労働市場の大部分が見過ごされていることが明らかになりました。

本記事「AI agent benchmarks obsess over coding while ignoring 92% of the US labor market, study finds」は、The Decoderに最初に掲載されました。

原文を表示

A large-scale study shows that AI agent development focuses almost entirely on programming tasks, ignoring the vast majority of the labor market.

Researchers from Carnegie Mellon University and Stanford University systematically compared 43 agent benchmarks totaling 72,342 tasks against the US labor market. They mapped benchmark tasks to 1,016 real occupations using the US government's O*NET database, which catalogs work activities at multiple levels of detail.

The researchers mapped each benchmark task to both occupational domains and skills in the O*NET taxonomy. | Image: Wang et al.

The study paints a lopsided picture. Current agent development targets the computer and math domain almost exclusively, a field dominated by programming that makes up just 7.6 percent of total US employment.

Highly digitized industries get almost zero benchmark coverage

The analysis turned up several work areas that are heavily digitized but barely show up in existing benchmarks. Management has a digitization rate of 88 percent, yet it represents just 1.4 percent of all benchmark tasks analyzed. Legal work (70 percent digital) accounts for 0.3 percent, and architecture and engineering (71 percent digital) for a mere 0.7 percent.

Computer and math dominates benchmark development with over 8,600 tasks, even though economically bigger domains like management and law employ far more people and generate more capital. | Image: Wang et al.

The researchers argue that AI agents could deliver near term productivity gains in exactly these areas. But these domains also come with specific technical challenges, including ambiguous goals and results that can only be verified over long stretches of time.

The study also identifies an economic blind spot. When looking at capital distribution, meaning total income per professional field, the most economically valuable areas like management and law remain underrepresented in benchmarks. At the same time, poorly paid, labor intensive fields like personal services and care barely get a look either, the researchers write.

Agents cover less than five percent of the skills workers actually need

The imbalance runs just as deep at the individual skill level, according to the study. The researchers built a taxonomy that breaks professional skills into four categories: information intake, mental processes, interaction with others, and work outcomes. In the real world, the required skills spread fairly evenly across all four.

Agent benchmarks, though, zero in on just two: "Getting Information" and "Working with Computers." Together, these cover less than five percent of US employment. The "Interacting with Others" category, which touches a huge share of real world jobs, barely registers in the benchmarks at all, the researchers found.

A detailed skill breakdown shows "Working with Computers" and "Getting Information" account for the bulk of benchmark tasks, while many high-employment skills like "Communicating with Supervisors" are barely covered. | Image: Wang et al.

The researchers attribute this bias to methodological convenience. Domains where it's easy to write task instructions and check results get disproportionate attention. While this has driven fast progress in niche areas, the team warns it risks steering agent development away from the fields where the social and economic payoff would be biggest.

The researchers single out OpenAI's GDPval benchmark as a positive example. Despite its relatively small size, it covers the widest range of professional domains and skills. OpenAI specifically designed the 2025 benchmark to measure how AI agents affect real knowledge work across different fields.

Agent autonomy falls off a cliff as tasks get harder

To gauge how autonomously AI agents actually operate within the work areas they cover, the researchers developed a quantifiable measure of autonomy. They define it as the maximum task complexity an agent can handle at a given success rate, with complexity measured by the number of steps in a hierarchical workflow.

The researchers translate low level agent actions like mouse clicks into hierarchical workflow steps to measure task complexity. | Image: Wang et al.

Even in software development, the most heavily represented field, the study shows success rates dropping sharply as tasks get more complex. According to the findings, agents do best on independent activities like mental processes and producing work products but struggle with finding and retrieving information and coordinating with others, even on relatively simple tasks.

The few benchmarks that allow controlled comparisons, like SWE-bench, show the OpenHands framework outperforming SWE-agent and Claude beating GPT, especially on medium-complexity tasks. However, the researchers caution that these trends don't necessarily hold at other complexity levels and call for broader publication of agent trajectories to enable more systematic comparisons.

Agent success rates drop as task complexity rises across all domains and skill categories. In controlled comparisons on SWE-bench, OpenHands outperforms SWE-agent, and Claude outperforms GPT. | Image: Wang et al.

What better benchmarks should look like

Based on their findings, the researchers lay out three design principles for future benchmarks. First, new benchmarks should specifically target underrepresented but highly digitized domains like management and law, or aim for broad coverage across domains and skills.

Second, the team argues that benchmarks need to get more realistic and complex. Their analysis found that many automatically generated benchmarks only capture simplified fragments of real work. By contrast, human created tasks, like those in the GDPval or TheAgentCompany benchmarks, span diverse domains and skills. When automated generation is needed for scale, the researchers say task creation should reflect realistic domain and skill compositions.

Most benchmarks cover only a handful of work domains per task. GDPval and TheAgentCompany stand out as exceptions, with comparatively broad domain and capability coverage. | Image: Wang et al.

Third, the researchers push for more granular evaluation. Simply measuring whether an agent finished a task misses where exactly it broke down, they argue. Instead, they suggest automatically deriving workflows from human demonstrations to create intermediate checkpoints that give a more detailed picture of agent performance. The study provides a framework and supporting resources to help benchmark designers spot gaps in work coverage, agent developers pinpoint areas for improvement, and users pick the right level of autonomy for their specific tasks.

These findings line up with real-world usage patterns. A recent Anthropic analysis based on millions of human agent interactions found that software development accounts for nearly 50 percent of all agent tool calls through the public API, while other industries clock in at just a few percentage points each. Anthropic described the current moment as the "early days of agent adoption."

A late 2025 study by UC Berkeley and partners reached a similar conclusion: in practice, companies mostly use AI agents as simple, tightly controlled tools with few autonomous steps. The biggest hurdle, according to that study, remains system reliability.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

この記事をシェア

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

The Decoder重要度42026年4月25日 21:44

「ChatGPT登場以降、米プログラマーの雇用成長がほぼ半減」連邦準備理事会の研究で判明

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む