ARC-AGI-3、未訓練人間と同等のAIに200万ドル提供も、最先端モデルは全て1%未満のスコア
新ベンチマーク「ARC-AGI-3」は、未訓練の人間と同等のパフォーマンスを達成したAIに200万ドルの賞金を提供しているが、現在の最先端モデルは全て1%未満のスコアしか達成できず、AIの汎用知能への道のりが依然として遠いことを示している。
キーポイント
高額賞金と厳しい現実
ARC-AGI-3ベンチマークは、未訓練の人間と同等のパフォーマンスを達成したAIに200万ドルの賞金を提供しているが、現在の最先端モデルは全て1%未満のスコアしか達成できていない。
インタラクティブなゲーム環境での評価
このベンチマークは、AIシステムをインタラクティブなゲーム環境に配置し、人間が簡単に解決できる課題をAIに解かせることで評価を行う。
AIの最大の利点を剥奪
ベンチマークは、現在のAIシステムの最大の利点(大規模な事前学習データなど)を取り除くことで、真の汎用知能能力を測定しようとしている。
フロンティアモデルの限界
全ての最先端(フロンティア)モデルが1%の壁を突破できていないことは、現在のAI技術が特定のタスクには優れていても、人間のような汎用的な問題解決能力にはまだ及ばないことを示している。
影響分析・編集コメントを表示
影響分析
この記事は、現在のAI技術が特定のタスクでは優れた性能を発揮する一方で、人間のような汎用的な知能にはまだ到達していないことを明確に示している。ARC-AGI-3のようなベンチマークは、AI研究の方向性を再定義し、真の汎用知能開発に向けた重要なマイルストーンとなる可能性がある。
編集コメント
AIの進歩を称賛する報道が多い中、この記事は現在の技術の限界を冷静に示しており、AGI実現への道のりが依然として長いことを再認識させられる。ベンチマーク設計の革新性が特に注目される。

新たなベンチマーク「ARC-AGI-3」は、AIシステムをインタラクティブなゲーム環境に置く。人間が容易に解くこの環境において、最先端モデルはいずれも1%の壁を超えられていない。このベンチマークは、AIの最大の強みを取り除いているためだ。
本記事「ARC-AGI-3は、訓練を受けていない人間と同等のAIに200万ドルを提供するも、最先端モデルはすべて1%未満に留まる」は、The Decoderで最初に公開されました。
原文を表示
The new ARC-AGI-3 benchmark drops AI systems into interactive game environments that humans solve with ease. No frontier model breaks the 1 percent mark because the benchmark strips away their biggest advantages.
The ARC Prize Foundation has released ARC-AGI-3, a new benchmark that tests AI systems in interactive, turn-based game environments. Unlike its predecessors, which had models derive static patterns from input-output pairs, ARC-AGI-3 requires AI agents to explore environments on their own, form hypotheses, figure out objectives, and execute plans - all without any instructions or hints about the goal. An early version was previewed in summer 2025.
According to the accompanying technical report, all 135 environments were solved by humans with no prior knowledge and no instructions. Every frontier model tested, meanwhile, scored below 1 percent: Gemini 3.1 Pro Preview hit 0.37 percent, GPT 5.4 reached 0.26 percent, Opus 4.6 managed 0.25 percent, and Grok-4.20 scored 0.00 percent.
One important caveat: machines and humans aren't measured on the same scale.
Squared efficiency penalizes brute force, not just wrong answers
ARC-AGI-3 uses a metric called RHAE (Relative Human Action Efficiency). Instead of simply checking whether a model solves a task, it measures how many actions the model needs compared to a human. Only interactions that actually change the game state count as actions, but internal computation steps or reasoning chains don't factor in. This makes scores from ARC-AGI-1/2 and ARC-AGI-3 incomparable.
The human baseline is the second-best performer out of ten first-time players per environment. According to the scoring documentation, the top player is deliberately excluded to filter out outlier performances while still maintaining a realistic reference for competent human play. Efficiency is calculated per level using a squared formula: (human actions / AI actions)^2. So if a human needs 10 actions and the AI needs 100, the AI doesn't get 10 percent - it gets just 1 percent. This squared penalty is designed to devalue brute-force strategies. Being faster than the human earns no bonus either, since the per-level score caps at 1.0. Later levels carry more weight because they require deeper understanding. For cost reasons, the team plans to limit the maximum number of attempts for agents to five times the human attempt count.
Custom scaffolding helps on known tasks but proves nothing about general intelligence
The official leaderboard only tests models via API without custom-built scaffolding (harnesses), using an identical system prompt for all of them.
The ARC Prize Foundation explains this choice in the paper: the benchmark is designed to measure the general intelligence of the AI model itself, not the human intelligence that went into building a task-specific system. A truly AGI-capable system shouldn't need external help to tackle new tasks.
Testing with Duke University revealed a striking pattern: Opus 4.6 scored 97.1 percent on a known environment using a hand-crafted harness, but dropped to 0 percent on an unfamiliar one.
This shows that perceiving the game environment and the API format aren't the bottleneck, custom-built strategies simply don't transfer to unseen environments. Chollet argues on X that true AGI shouldn't need task-specific human guidance, especially when ordinary humans can handle the same tasks without any help.
There is still a separate community leaderboard for harness-driven results, where scores are self-reported. The Foundation explicitly warns against interpreting these results as evidence of AGI progress. However, it expects the best ideas from harness research to eventually make their way into the models themselves—much like chain-of-thought prompting started as an external technique and eventually became a built-in feature in OpenAI's o1.
Chollet addresses this objection on X, pushing back on the idea that the low scores are simply an artifact of missing harnesses and a basic prompt. The G in AGI stands for "general," he argues, and general intelligence doesn't mean being specifically trained for a wide range of tasks. It means facing any new task and solving it independently. If ordinary humans can do that without instructions or tools, there's no reason AGI should need special handholding and hand-crafted prompts.
Chollet sees only two positions here: either you believe AGI is possible, in which case a true AGI system will eventually solve ARC-AGI-3 because normal humans can too, or you believe AI is merely an automation tool that will always need human intervention for every new task.
Recordings of the latest model tests can be viewed on the ARC Prize website.
Previous ARC benchmarks predicted key AI breakthroughs before they happened
The predecessor benchmarks have repeatedly flagged turning points in AI development over the past few years. ARC-AGI-1 was likely the first benchmark to precisely identify the breakthrough of frontier AI reasoning systems like OpenAI's o3 at a time when other benchmarks were already saturated.
ARC-AGI-2 then captured the rapid progress of modern reasoning models and the rise of scaffolding, which has since been deployed in production coding tools like Claude Code and Codex. By now, both benchmarks are effectively saturated, largely thanks to these scaffolding approaches.
ARC-1 and ARC-2 are now saturated - but getting there required real breakthroughs in AI capabilities.
ARC-AGI-3 aims to measure the next open gap: agentic intelligence, the ability to navigate completely unfamiliar environments without specific training. The fact that every current frontier model scores below 1 percent shows, according to the creators, just how far AI systems remain from human-like adaptability.
The ARC Prize Foundation has made 25 environments publicly available to play and is running the ARC Prize 2026 on Kaggle with $2 million in prize money.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み