読み込み中…

AI Snake Oil·2024年7月4日 01:00·約11分

新論文：実用的なAIエージェントとは

#LLM #AI Agents #Evaluation Metrics #Benchmarking

TL;DR

プリンストン大学の研究者らが、AI エージェントの現状評価基準が実用性を欠く「ベンチマーク最適化」を招いていると指摘し、より厳密な定義と評価手法の確立を提言した論文を発表しました。

AI深層分析2026年5月3日 05:09

重要/ 5段階

深度40%

キーポイント

エージェント評価の課題とハブワード化

「AI エージェント」という用語が明確な定義を持たずマーケティング用語として濫用されており、ベンチマークで高得点を取っても実社会では役に立たないシステムが増加している現状を指摘しています。

エージェント性の連続体としての再定義

エージェントを二値ではなく「スペクトラム（連続体）」として捉え、環境の複雑さ、目標の自律性、ユーザーインターフェース、監督の必要性という 3 つの主要な属性クラスに基づいて評価すべきと提案しています。

実用的なエージェント開発への提言

現在の研究がベンチマーク対策に偏りすぎている傾向を是正し、複雑な環境や不確実性に対応できる、真に自律的で実用的なアシスタント（Siri や Alexa のようなもの）の実現に向けた評価基準の改善を求めています。

AI エージェントの成功には信頼性の向上が不可欠

LLM の能力自体は十分でも、複雑なタスクを実行するエージェントは個々の失敗確率が累積してシステム全体を無効にするため、信頼性向上の研究が急務である。

現在の評価手法の甘さが過熱感を助長している

機械学習研究の初期段階と同様に、厳密な評価基準が確立されていないことが、AI エージェント分野における過度な期待や hype を生んでいる。

コスト制御と最適化の重要性

ランダム性を有する言語モデルを単純に複数回呼び出すだけで複雑なアーキテクチャを上回る結果が出る場合があり、評価にはコスト制御が不可欠である。

ベンチマークの短絡的対策防止

エージェントがベンチマークに過剰適合（ショートカット）するリスクを回避するためには、目指す一般化レベルに応じた適切なホールドアウトサンプルが必要である。

重要な引用

The North Star of this field is to build assistants like Siri or Alexa and get them to actually work

This state of affairs encourages the development of agents that do well on benchmarks without being useful in practice.

Rather than a binary, it can be seen as a spectrum, sometimes denoted by the term 'agentic'.

If each of those went wrong independently with a probability of, say, just 2%, the overall system would be so unreliable as to be completely useless

research is itself contributing to hype and overoptimism because evaluation practices are not rigorous enough

We show that such simple tricks can outperform complex agent architectures on the HumanEval benchmark, while costing much less.

影響分析・編集コメントを表示

影響分析

この記事は、AI エージェント分野における「ベンチマーク詐欺」や過剰な期待（ハイク）に対する重要な警鐘を鳴らしており、業界全体が実用性の高い評価指標へシフトする必要性を浮き彫りにしています。プリンストン大学という権威ある機関からの提言であるため、今後の研究動向や製品開発の方向性を修正し、実社会での信頼性向上に寄与する重要なマイルストーンとなるでしょう。

編集コメント

「エージェント」という言葉が単なる流行語に堕していないか、実用性を問う鋭い視点です。ベンチマークスコアに一喜一憂する業界に対し、根本的な評価基準の再構築を促す重要な論文です。

大規模言語モデルの最もエキサイティングな応用の一部には、飛行機の予約やソフトウェアのバグを見つけて修正するなど、現実世界での行動をとるものが含まれます。このようなタスクを実行する AI システムは「エージェント」と呼ばれます。これらは LLM と他のソフトウェアを組み合わせ、ウェブ検索やコード端末などのツールを使用します。

この分野における北極星は、Siri や Alexa のようなアシスタントを構築し、実際に機能させることです——複雑なタスクを処理し、ユーザーの要求を正確に解釈し、信頼性を持って実行することです。しかし、これはまだ現実とはほど遠く、研究の方向性自体も比較的新しいものです。エージェントの開発を刺激し、その有効性を測定するために、研究者たちはベンチマークデータセットを作成しました。しかし、前述したように LLM の評価は地雷原であり、エージェントの評価には今日のベンチマークや評価手法に影響を与える追加的な落とし穴が多数あることが明らかになっています。この状況は、実用性がないにもかかわらずベンチマークで良好な結果を出すようなエージェントの開発を促すことになります。

私たちは、エージェントの評価における課題を特定し、それに対処する方法を提案する新しい論文を発表しました。論文はこちらでお読みください。著者は、プリンストン大学に所属する Sayash Kapoor 氏、Benedikt Ströbl 氏、Zachary S. Siegel 氏、Nitya Nadgir 氏、および Arvind Narayanan 氏です。

本稿では、AI エージェントの定義について考えを述べ、なぜ AI エージェント研究の未来に対して慎重な楽観主義を抱いているのか、また AI エージェントは単なる hype なのか実体のあるものなのかについて論じ、論文の概要を簡潔に紹介します。

エージェントという用語は何を意味するのか？それは単なる流行語に過ぎないのでしょうか。

「エージェント」という用語は、AI 研究者によって形式的な定義なしに使用されてきました。1 このため、マーケティング用語として乗っ取られ、その使用に対するある程度の反発が生じています。しかし、この用語が無意味であるわけではありません。多くの研究者が、言語モデルベースのシステムにおける文脈で何をもってエージェントとみなすかというコミュニティの直感的理解を形式化しようと試みてきました [1, 2, 3, 4, 5]。これは二値的なものではなく、「アジェンティック（agentic）」という用語で示されるようなスペクトラムとして捉えることができます。

上記に引用された AI エージェントに関する最近の五つの定義はすべて異なりますが、互いに強い類似性を有しています。新たな定義を提案するのではなく、既存の定義に従って AI システムをよりアジェンティックとみなす要因となる三つの性質のクラスターを特定しました：

環境と目標。環境が複雑であればあるほど、その環境で動作する AI システムはよりアジェンティックとなります。複雑な環境とは、多様なタスクやドメインを持ち、複数の利害関係者が存在し、行動をとるための長い時間的視野があり、予期せぬ変化が生じるような環境です。さらに、目標を追求する方法について指示されることなく複雑な目標を追求するシステムほど、アジェンティックとなります。

ユーザーインターフェースと監視。自然言語で指示を出し、ユーザーに代わって自律的に行動できる AI システムほど、よりエージェント性が高いと言えます。特に、ユーザーの監視を必要としないシステムほど、よりエージェント性が高いのです。例えば、チャットボットは現実世界でのアクションを実行できませんが、チャットボットにプラグイン（ChatGPT 用の Zapier など）を追加することで、ユーザーに代わっていくつかのアクションを実行できるようになります。

システム設計。ツール（ウェブ検索やコードターミナルなど）を使用したり、計画立案（過去の出力を振り返ったり、目標をサブゴールに分解したりする機能など）を取り入れたシステムほど、よりエージェント性が高いと言えます。LLM によって制御フローが駆動されるシステムは、静的なプログラムから LLM が呼び出されるシステムよりも、よりエージェント性が高いのです。

エージェントは実際に機能するのか？

ChatGPT のコードインタープリターやデータ分析モードのような一部のエージェントは有用でしたが、より野心的なエージェントベースの製品はまだ失敗に終わっています。AI エージェントを基盤とした主な製品発表 2 つは、Rabbit R1 と Humane AI pin です。これらのデバイスは電話への依存を排除または軽減することを約束していましたが、結果として遅すぎて信頼性に欠けることが判明しました。「AI ソフトウェアエンジニア」として大きな期待を浴びて 4 ヶ月前に発表された Devin は、動画レビューで酷評され、現在も待機リスト限定モードのままです。AI エージェントが現実世界の製品で有用となるためには、まだ長い道のりがあることは明らかです。

image

出典

では、AI エージェントはすべて過大評価なのでしょうか？結論を言うにはまだ早すぎます。上記のようなエージェントが広く採用されるのに十分なほど機能するようになるまでには、解決すべき研究課題が存在すると私たちは考えています。それを知る唯一の方法はさらなる研究を通じてであり、したがって AI エージェントに関する研究の価値はあると私たちは考えます。

主要な研究課題の一つは信頼性です。大規模言語モデル（LLM）はすでに、アシスタントに任せたい多くのタスクを実行するのに十分な能力を持っていますが、成功する製品となるにはまだ信頼性が十分ではありません。なぜそうなのかを理解するために、数十回もの LLM への呼び出しを必要とする飛行機予約エージェントを想像してみてください。それぞれの呼び出しが独立して、例えばわずか 2% の確率で失敗した場合でも、全体のシステムはあまりにも信頼性が低く、完全に役に立たないものになってしまいます（これは私たちが目にしてきた製品の一部の失敗を部分的に説明しています）。したがって、基盤となる言語モデル自体が改善されなくても、信頼性の向上に関する研究には多くの新たな応用分野が開ける可能性があります。そしてもしスケーリングが行き詰まったとしても、AI におけるさらなる進展のための最も自然な方向性はエージェントです。

しかし現在、評価手法が十分厳密ではないために、研究自体が過剰な期待や過度な楽観主義を助長しています。これは、一般的なタスク手法が定着する前の機械学習研究の初期段階と非常に似ています。これが私たちの論文につながります。

論文の貢献

AI コミュニティは、ベンチマーク上だけでなく現実世界でも有用な AI エージェントの開発を刺激するために、どのような変更を実施すべきでしょうか？これが本論文の中核となる問いです。私たちは5 つの推奨事項を提示します：

コスト制御付き評価の実施。多くの AI エージェントの基盤となっている言語モデルは確率的（stochastic）です。つまり、基盤となるモデルを単に複数回呼び出すだけで精度が向上する可能性があります。私たちは、このような単純なトリックが複雑なエージェントアーキテクチャよりも HumanEval ベンチマークで優れた結果を示す一方で、コストは大幅に低いことを示しました。すべてのエージェント評価においてコスト制御を行うべきだと主張します。（この発見は当初こちらで発表しました。この投稿公開から2 ヶ月以内に、パレート曲線とコスト・精度の共同最適化がエージェント評価においてますます一般的になっています。）

精度とコストの共同最適化。評価結果を精度と推論コストのパレート曲線として可視化することで、エージェント設計における新たな領域が開かれます：すなわち、両方の指標を同時に最適化するアプローチです。私たちは、DSPy フレームワーク（DSPy）に修正を加えることで、HotPotQA において精度を維持しながらコストを削減する方法を示します。

モデル評価と下流タスクの評価を区別する。NovelQA の事例研究を通じて、モデル評価のために設計されたベンチマークが下流タスクの評価に用いられる場合、誤解を招く可能性があることを示す。我々は、下流タスクの評価ではコストそのもの（ドルコスト）を考慮すべきであり、モデルパラメータ数などのコストの代理指標に依存すべきではないと主張する。

エージェントベンチマークにおける近道（ショートカット）を防ぐ。エージェントベンチマークへの過学習が多数存在しうることを示す。我々はエージェントの一般化能力を 4 つのレベルに分類し、目指す一般化のレベルに応じて異なる種類のホールドアウトサンプルが必要であると主張する。適切なホールドアウトがない場合、エージェント開発者は意図せずとも近道を取ってしまう可能性がある。この点は WebArena ベンチマークの事例研究を通じて示される。

エージェントベンチマークの標準化と再現性の向上。WebArena および HumanEval の評価における再現性には広範な欠陥があることを発見した。これらの誤りは精度推定値を過大評価し、エージェント能力に対する過度な楽観主義を生み出す。

結論：慎重な楽観主義の理由

AI エージェントの評価は比較的新しく、ベストプラクティスがまだ確立されていないため、真の進歩と hype（過剰な期待）を見分けることが難しい。我々は、エージェントはモデルとは十分に異なる性質を持つため、評価手法の見直しが必要だと考える。本論文では、エージェント評価のための原理に基づくアプローチへの第一歩を踏み出した。これらの一歩が AI エージェントの評価の厳密性を高め、進展のための確固たる基盤を提供することを願っている。

私たちの研究の別の側面は、医学や社会科学などの科学分野における機械学習ベースの研究における再現性の危機に関するものです。ある意味では、現在の論文もこれと似ています。機械学習に基づく科学においては、改善される前に状況が悪化するだろうという見方があります。しかし、AI エージェント研究においては、実践が急速に変化するという点で慎重に楽観視しています。その理由の一つは、公開された論文とともにコードやデータを共有する文化がより強固であるため、エラーを容易に見つけることができることです。（この文化の変化は、過去 5 年間の集中的な取り組みによってもたらされました。）もう一つの理由は、誤解を招く評価に基づいた製品が失敗に終わることで、楽観的すぎる研究がすぐに現実に直面させられるからです。今後数年間は、研究面でもプロダクトリリースの面でも注目すべき面白い分野となるでしょう。

1 従来の AI では、エージェントは環境を知覚しそれに対して行動する定義された実体とされていますが、その定義は LLM（大規模言語モデル）時代においてはあまり有用ではありません。この定義の下では、サーモスタットさえもエージェントとして分類されてしまいます。

原文を表示

Some of the most exciting applications of large language models involve taking real-world action, such as booking flight tickets or finding and fixing software bugs. AI systems that carry out such tasks are called agents. They use LLMs in combination with other software to use tools such as web search and code terminals.

The North Star of this field is to build assistants like Siri or Alexa and get them to actually work — handle complex tasks, accurately interpret users’ requests, and perform reliably. But this is far from a reality, and even the research direction is fairly new. To stimulate the development of agents and measure their effectiveness, researchers have created benchmark datasets. But as we’ve said before, LLM evaluation is a minefield, and it turns out that agent evaluation has a bunch of additional pitfalls that affect today’s benchmarks and evaluation practices. This state of affairs encourages the development of agents that do well on benchmarks without being useful in practice.

We have released a new paper that identifies the challenges in evaluating agents and proposes ways to address them. Read the paper here. The authors are Sayash Kapoor, Benedikt Ströbl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan, all at Princeton University.

In this post, we offer thoughts on the definition of AI agents, why we are cautiously optimistic about the future of AI agent research, whether AI agents are more hype or substance, and give a brief overview of the paper.

What does the term agent mean? Is it just a buzzword?

The term agent has been used by AI researchers without a formal definition.1 This has led to its being hijacked as a marketing term, and has generated a bit of pushback against its use. But the term isn’t meaningless. Many researchers have tried to formalize the community's intuitive understanding of what constitutes an agent in the context of language-model-based systems [1, 2, 3, 4, 5]. Rather than a binary, it can be seen as a spectrum, sometimes denoted by the term 'agentic'.

The five recent definitions of AI agents cited above are all distinct but with strong similarities to each other. Rather than propose a new definition, we identified three clusters of properties that cause an AI system to be considered more agentic according to existing definitions:

Environment and goals. The more complex the environment, the more AI systems operating in that environment are agentic. Complex environments are those that have a range of tasks and domains, multiple stakeholders, a long time horizon to take action, and unexpected changes. Further, systems that pursue complex goals without being instructed on how to pursue the goal are more agentic.

User interface and supervision. AI systems that can be instructed in natural language and act autonomously on the user's behalf are more agentic. In particular, systems that require less user supervision are more agentic. For example, chatbots cannot take real-world action, but adding plugins to chatbots (such as Zapier for ChatGPT) allows them to take some actions on behalf of users.

System design. Systems that use tools (like web search or code terminal) or planning (like reflecting on previous outputs or decomposing goals into subgoals) are more agentic. Systems whose control flow is driven by an LLM, rather than LLMs being invoked by a static program, are more agentic.

Do agents even work?

While some agents such as ChatGPT’s code interpreter / data analysis mode have been useful, more ambitious agent-based products so far have failed. The two main product launches based on AI agents have been the Rabbit R1 and Humane AI pin. These devices promised to eliminate or reduce phone dependence, but turned out to be too slow and unreliable. Devin, an “AI software engineer”, was announced with great hype 4 months ago, but has been panned in a video review and remains in waitlist-only mode. It is clear that if AI agents are to be useful in real-world products, they have a long way to go.

Source

So are AI agents all hype? It’s too early to tell. We think there are research challenges to be solved before we can expect agents such as the ones above to work well enough to be widely adopted. The only way to find out is through more research, so we do think research on AI agents is worthwhile.

One major research challenge is reliability — LLMs are already capable enough to do many tasks that people want an assistant to handle, but not reliable enough that they can be successful products. To appreciate why, think of a flight-booking agent that needs to make dozens of calls to LLMs. If each of those went wrong independently with a probability of, say, just 2%, the overall system would be so unreliable as to be completely useless (this partly explains some of the product failures we’ve seen). So research on improving reliability might have many new applications even if the underlying language models don’t improve. And if scaling runs out, agents are the most natural direction for further progress in AI.

Right now, however, research is itself contributing to hype and overoptimism because evaluation practices are not rigorous enough, much like the early days of machine learning research before the common task method took hold. That brings us to our paper.

Contributions of the paper

What changes must the AI community implement to help stimulate the development of AI agents that are useful in the real world, and not just on benchmarks? This is the paper’s central question. We make five recommendations:

Implement cost-controlled evaluations. The language models underlying most AI agents are stochastic. This means simply calling the underlying model multiple times can increase accuracy. We show that such simple tricks can outperform complex agent architectures on the HumanEval benchmark, while costing much less. We argue that all agent evaluation must control for cost. (We originally published this finding here. In the two months since we published this post, Pareto curves and joint optimization of cost and accuracy have become increasingly common in agent evaluations.)

Jointly optimize accuracy and cost. Visualizing evaluation results as a Pareto curve of accuracy and inference cost opens up a new space of agent design: jointly optimizing the two metrics. We show how we can lower cost while maintaining accuracy on HotPotQA by implementing a modification to the DSPy framework.

Distinguish model and downstream benchmarking. Through a case study of NovelQA, we show how benchmarks meant for model evaluation can be misleading when used for downstream evaluation. We argue that downstream evaluation should account for dollar costs, rather than proxies for cost such as the number of model parameters.

Prevent shortcuts in agent benchmarks. We show that many types of overfitting to agent benchmarks are possible. We identify 4 levels of generality of agents and argue that different types of hold-out samples are needed based on the desired level of generality. Without proper hold-outs, agent developers can take shortcuts, even unintentionally. We illustrate this with a case study of the WebArena benchmark.

Improve the standardization and reproducibility of agent benchmarks. We found pervasive shortcomings in the reproducibility of WebArena and HumanEval evaluations. These errors inflate accuracy estimates and lead to overoptimism about agent capabilities.

Concluding thoughts: reasons for cautious optimism

AI agent benchmarking is new and best practices haven't yet been established, making it hard to distinguish genuine advances from hype. We think agents are sufficiently different from models that benchmarking practices need to be rethought. In our paper, we take the first steps toward a principled approach to agent benchmarking. We hope these steps will raise the rigor of AI agent evaluation and provide a firm foundation for progress.

A different strand of our research concerns the reproducibility crisis in ML-based research in scientific fields such as medicine or social science. At some level, our current paper is similar. In ML-based science, our outlook is that things will get worse before they get better. But in AI agents research, we are cautiously optimistic that practices will change quickly. One reason is that there is a stronger culture of sharing code and data alongside published papers, so errors are easier to spot. (This culture shift came about due to concerted efforts in the last five years.) Another reason is that overoptimistic research quickly gets a reality check when products based on misleading evaluations end up flopping. This is going to be an interesting space to watch over the next few years, both in terms of research and product releases.

1In traditional AI, agents are defined entities that perceive and act upon their environment, but that definition is less useful in the LLM era — even a thermostat would qualify as an agent under that definition.

この記事をシェア

The Zvi重要度42026年7月25日 22:40

Claude Opus 5 システムカード発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む