Hugging Face Blog·2026年4月15日 21:07·約5分で読める

VAKRAの内部：エージェントの推論、ツール使用、失敗モード

#AIエージェント #ベンチマーク #ツール使用 #推論 #API連鎖 #エンタープライズAI

TL;DR

Hugging Faceが発表したVAKRAベンチマークは、8,000以上のAPIとドキュメントを組み合わせた実環境でAIエージェントの推論とツール使用能力を評価する包括的なフレームワークを提供し、現在のモデルの限界と失敗モードを分析している。

AI深層分析2026年4月16日 03:44

重要/ 5段階

深度40%

キーポイント

実践的なエージェント評価環境の構築

VAKRAは62ドメインにわたる8,000以上のローカルAPIと実際のデータベース、ドメインに沿った文書コレクションを備えた実行可能な環境を提供し、従来の単一スキル評価を超えた包括的な評価を可能にする。

複合的な推論能力の測定

3-7ステップの推論チェーンを必要とするタスクを通じて、構造化API操作と非構造化情報検索を自然言語ツール使用制約下で組み合わせる能力を評価する。

現在のモデルの限定的な性能

記事ではモデルがVAKRAで低い性能を示しており、異なるタスクにおける失敗モードの詳細な分析が行われている。

API連鎖能力の具体的評価

ビジネスインテリジェンスAPIを使用したAPI連鎖能力を評価する2,077のテストインスタンスを含み、1-12のツール呼び出しを連鎖させる複雑なワークフローを要求する。

SEL-BIRDコレクションの特殊化ツール

SLOT-BIRDの汎用データ操作ツールセットを拡張し、カテゴリ引数を個別関数に平坦化することで、sort_data_ascendingやsort_data_descendingなどの特殊化ツールを導入している。

REST-BIRDコレクションのAPI選択課題

17ドメインにわたる1,597インスタンスで、ドメイン固有のREST APIツールセットから適切なエンドポイントを選択する能力を必要とし、各ドメインには平均116ツールが含まれる。

OpenAI APIのツールリスト制限対応

最大128ツールというAPI仕様制限に対応するため、ベースラインエージェントでは短縮リスト機構を実装してツールリストの長さを管理している。

影響分析・編集コメントを表示

影響分析

この記事はAIエージェントの評価方法論に重要な進展をもたらし、単純なタスク解決から複雑な現実世界ワークフローへの移行を促進する。企業環境でのAI導入において、信頼性の高いエージェント開発のための明確な評価基準を提供し、研究と実装のギャップを埋める役割を果たす。

編集コメント

AIエージェントの実用化に向けた重要なマイルストーンとなるベンチマークで、華やかなデモンストレーションから実際の信頼性評価への転換点を示している。特に企業環境での適用可能性に焦点を当てた点が実用的価値を高めている。

記事に戻る VAKRAの内部: エージェントの推論、ツール利用、および失敗モード

Upvote 3

VAKRA データセット | リーダーボード | リリースブログ | GitHub | リーダーボードに投稿

私たちは最近、企業に似た環境においてAIエージェントがどれだけうまく推論し行動できるかを評価するための、ツール基盤型で実行可能なベンチマーク「VAKRA」を紹介しました。

単独のスキルをテストする従来のベンチマークとは異なり、VAKRAはAPIやドキュメントにわたる合成的推論を測定し、完全な実行トレースを用いて、エージェントがマルチステップのワークフローを確実に完了できるかどうかを評価します。

VAKRAは、62のドメインにわたる実際のデータベースを背景に持つ8,000以上のローカルホスト型APIと、ドメインに沿ったドキュメントコレクションを備えた実行可能な環境を提供します。タスクには、自然言語のツール利用制約の下で、構造化されたAPI操作と非構造化の情報検索を組み合わせた、3〜7ステップの推論チェーンが必要になる場合があります。

下記に見られるように、モデルはVAKRAにおいて低いパフォーマンスを示しています。このブログでは、VAKRAのタスクに関する追加のデータセット詳細と、様々なタスクで観察された失敗モードの分析を提示します。

タスクの説明

下図に示すように、VAKRAベンチマークは4つのタスクで構成され、それぞれが異なる能力セットをテストします。

図1: VAKRAベンチマークにおける各能力の代表的な例

能力1: ビジネスインテリジェンスAPIを用いたAPIチェイニング

この能力は、54のドメインにわたる2,077のテストインスタンスを含み、SLOT-BIRDおよびSEL-BIRDコレクション（Elder et al., 2026）のツールの使用を必要とします。Elder et al.の設定と比較して、SLOT-BIRDおよびSEL-BIRDのツール群は、より多くのドメインを含めることで拡張されています。各ドメインは1つのツールコレクションに制限され、タスクには最終的な答えに到達するための1〜12回のツール呼び出しの連鎖が含まれます。

{ "query": "ビルドアッププレーのスピードが31、ビルドアッププレーのドリブルが53、ビルドアッププレーのパスが32であるサッカーチームはどれか？", "tool_calls":[ { "name": "get_data", "arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"}, "label": "retrieved_data_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31}, "label": "FILTERED_DF_0" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53}, "label": "FILTERED_DF_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32}, "label": "FILTERED_DF_2" }, {"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}], "answer": "FC Barcelona" }

図2: SEL-BIRDコレクションからのデータサンプル

上記のように、各インスタンスには、そこから答えを導き出さなければならない関連JSONデータソースがあります。このタスクをサポートするMCPサーバーには、get_data(tool_universe_id=id)と呼ばれる特別なツールが含まれています。

tool_universe_id

SLOT-BIRDコレクションは、TableauやGoogle Analyticsのようなシステムに着想を得て、汎用的なデータ操作（例：フィルタリング、ソート）のための7つのグローバルツールセットを提供します。SEL-BIRDコレクションは、これを拡張してより専門的なツールを導入しています。いくつかはSLOT-BIRDと共有されていますが、他のものはカテゴリカルな引数を別々の関数に平坦化することで導出されています（例：sort_data、sort_data_ascending、sort_data_descending）。

{ "handle": "retrieved_data_1", "num_records": 2, "key_details": [ {"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]}, {"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]}, {"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]}, {"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]} ]}

図3: get_data関数から得られたデータプレビュー

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis/

能力2: ダッシュボードAPIを用いたツール選択

この能力は、17のドメインにわたる1,597のインスタンスを含み、拡張されたREST-BIRDコレクション（Elder et al.）のツールを必要とします。これらは、ほとんどの計算をカプセル化した、クエリに合わせた高度に特化したエンドポイントを提供する、エンドポイント形式のインターフェースを使用します。これらはFastAPIサーバーで実行されるREST APIとして提供され、MCPサーバーによってラップされています。このタスクでは、ドメイン固有のツールセットから正しいAPIを選択する必要があります（図1の例を参照）。各ドメインには、最小6から最大328のツール（平均116ツール）が含まれています。前のタスクと同様に、get_dataツールが利用可能です。

OpenAI API仕様は、ツールリストの入力の最大長を128ツールに制限しています。この制限により、このAPIを使用するエージェントビルダーは、ショートリスト化メカニズムを介してツールリストの長さを直接管理する必要があります。私たちのリポジトリにあるベースラインエージェントでは、シンプルなショートリスト化機能がこの課題に対処しています。

能力3: ダッシュボードAPIを用いたマルチホップ推論

ベンチマークの能力3セグメントは、38の主題ドメインから抽出された869のテストインスタンスで構成されています。これらのインスタンスもREST-BIRD APIコレクションに依存していますが、課題にマルチホップ推論を追加しています（図1の例を参照）。マルチホップの質問には、答えに到達するために複数の支持証拠を抽出して組み合わせる必要があります。このセクションのインスタンスは、クエリに答えるために1〜5の論理ホップを必要とします。テストデータセット内のクエリの質問タイプ分布を、下の図4に示します。

図4: 能力3（マルチホップ）およびハイブリッドホップタイプのAPIホップタイプ分布

原文を表示

Back to Articles Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Upvote 3

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.

VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.

As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.

Task Description

As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.

Fig 1: Representative examples of each capability in the VAKRA benchmark

Capability 1: API Chaining using Business Intelligence APIs

This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.

{ "query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?", "tool_calls":[ { "name": "get_data", "arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"}, "label": "retrieved_data_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31}, "label": "FILTERED_DF_0" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53}, "label": "FILTERED_DF_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32}, "label": "FILTERED_DF_2" }, {"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}], "answer": "FC Barcelona" }

Fig 2: Data sample from SEL-BIRD collection

As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id)

tool_universe_id

The SLOT-BIRD collection provides a global set of 7 tools for generic data manipulation (e.g., filtering, sorting), inspired by systems like Tableau and Google Analytics. The SEL-BIRD collection extends this by introducing more specialized tools: some are shared with SLOT-BIRD, while others are derived by flattening categorical arguments into separate functions (e.g., sort_data

sort_data_ascending

sort_data_descending

Fig 3: Data preview obtained from get_data function

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis/

Capability 2: Tool Selection using Dashboard APIs

This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.). These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set (as shown in the example in Figure 1). Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools). Similar to the previous task, the get_data

The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge.

Capability 3: Multi-Hop Reasoning using Dashboard APIs

The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection, but add multi-hop reasoning to the challenge (refer to example in Figure 1). Multi-hop questions require multiple pieces of supporting evidence to be extracted and combined to reach an answer. The instances in this section require between one and five logical hops to answer a query. The question types distribution for queries within the test dataset is shown below in Figure 4.

Fig 4: API Hop-Type distribution for Capability 3 (MultiHop) and Hybrid Hop-Type distribution for Capability 4 (MultiHop MultiSource Reasoning)

Capability 4: Multi-Hop, Multi-Source Reasoning and Policy Adherence

Capability 4 includes 644 instances across 41 domains and is also built on the REST-BIRD API collection. Figure 4 above shows a distribution of hybrid hops for test queries without policies. It contains the most complex queries with the following characteristics:

Multi-Source: This segment adds document indices per domain. Queries in this capability could require information from these document indexes as well as API calls. Similar to Capability 3, this task also has Multi-Hop queries. The required information source applies at the per-hop level, so, for example, a question may entail three logical hops with sources: API - RAG (Document Retrieval) - API. To enforce correct reasoning, sources are decontaminated during data generation, i.e. information required for a given hop is available in only one source. For example, if a hop is to be answered using APIs, the document index is built by removing documents that likely contain the information needed to answer the question.

Multi-Turn: This segment of the dataset also adds multi-turn conversations to the setting. Each instance is a dialog with multiple turns. The data is released as context-response pairs, where the context encodes the current dialog history and the agent is only responsible for answering the current turn.

Tool-usage Policies: A subset of these instances includes tool-use policies that the agent is required to follow. These policies take the form of plain-text instructions about the knowledge sources that the agent is allowed to access and under which circumstances. For example:

If a user's query pertains to Technology & Software, which is/are about Topics focusing on codebases, software platforms, applications, and user interactions in tech, make sure you try answering them by only using document retrievers. Do not use other types of tools.

The baseline agent in the project repo imposes adherence to these policies through a simple addition to the prompt: "You are a helpful assistant with access to tools.\n Tool Usage Constraint: {additional_instructions}."

Evaluation Framework

VAKRA evaluates agents in tool environments where success depends on both the ability to execute coherent, multi-step workflows and answer correctness. We introduce an execution-centric evaluation framework that assesses not only final outputs but also the full tool-execution trajectory that includes tool calls, inputs, and intermediate results.

Evaluation Metric

The VAKRA Evaluator operates over two key inputs for each sample: a predicted final response and the corresponding tool-call trajectory. The tool calls from the predicted trajectory are executed in the same environment as the ground truth to verify intermediate tool outputs.

The evaluation follows a waterfall-style pipeline (Figure 6), where later stages are conditioned on earlier success:

For Capability 4 tasks, policy adherence is first verified programmatically (this step is not applied to other capabilities).

The predicted tool call sequence is then compared against the ground truth sequence.

Only samples with valid trajectories proceed to final response evaluation.

Fig 6: Waterfall-style Evaluation Pipeline

Tool-Sequence Comparison Due to the presence of an executable environment, agents can explore the environment and sometimes return the answer by invoking a different set of APIs than the ones identified by us. In order to support alternative but valid tool invocations and reasoning paths, correctness is assessed by executing each predicted tool and comparing the set of tool responses against those from the ground truth (rather than enforcing strict step-level matching).

Specifically, we first perform a programmatic check, verifying whether all information present in the ground-truth tool responses is recovered by the predicted tool responses. This check may be inconclusive in cases involving partial matches, semantic equivalence, or differences in representation (e.g., ordering, aggregation, or formatting). In such cases, we apply a secondary LLM-based evaluation, adapted from the CRAG framework Yang et al., 2024, to determine whether the predicted trajectory retrieves all required information despite structural differences. This step uses an adapted prompt to determine whether the predicted trajectory captures all required information, even if obtained through a different sequence of tool calls.

Final Response Evaluation For trajectories that pass the previous check, the final response is evaluated using an LLM-based judge. This step ensures that the response is (i) grounded in the predicted tool outputs, and (ii) factually consistent with the ground truth answer, accounting for potential variations in phrasing or structure.

This design ensures that agents are rewarded not only for producing correct answers, but for obtaining them through valid and complete reasoning processes.

Every capability is equally weighted to obtain a final leaderboard score

To obtain a capability score, every sample within a capability is equally weighted for capabilities 1 through 3.

For capability 4, we weight heterogeneous queries higher:

We now present detailed error analysis across the four VAKRA capabilities. To facilitate our analysis, we adopt stage-wise error categorization to assign each failure to the first point of breakdown. Specifically, we evaluate, in order: (i) whether the correct tool(s) were selected, (ii) whether the required arguments were provided without omissions or hallucinations, (iii) whether argument values were correct, and (iv) whether the final response is both accurate and grounded in the tool outputs.

Failure Stage Isolation

Since a single sample may exhibit multiple errors across different steps, we sequentially classify each instance to the earliest failing stage (e.g., tool selection errors take precedence over argument errors). This avoids double-counting and allows error categories to be interpreted as disjoint fractions of the dataset. While more granular metrics (e.g., precision/recall over tool usage) are possible (Elder et al., 2026), we find this formulation provides a simple and interpretable breakdown of agent failures.

The instances in this part of the benchmark required selecting and sequencing multiple tools to solve a single task. We have 2077 samples in this capability. This was challenging for all models, but GPT-OSS-120B performed best on this segment of the benchmark.

GPT-OSS-120B outperformed the other models by a large margin, mostly from a better understanding of the tool schemas.

The tools in this part of the benchmark involve a large number of parameters, many of which are optional, and GPT-OSS-120B was especially robust, as compared to the others, at choosing the right parameters to fill.

Overall, synthesizing a correct answer after making all tool calls correctly was less challenging in this section of the benchmark, most likely because the tool call sequencing made the tool choice problem less amenable to guessing compared with the Dashboard API capability

Fig 7: SEL-BIRD vs SLOT-BIRD Error Types Analysis

The Business Intelligence (BI) API capability contains two sets of APIs, from the SLOT-BIRD and SEL-BIRD tool collections. The SEL part of this benchmark had 600 samples, while the SLOT part of the benchmark had 1477 samples. These two collections are grouped under the BI API capability, but have slightly different characteristics. The SLOT-BIRD collection has a smaller number of generic tools with a large number of parameter values to fill, while the SEL-BIRD collection has a larger set of tools and fewer parameters per tool. This focus is reflected in the relative errors made by models using these two tool collections.

Using SLOT-BIRD, all models except for GPT-OSS-120b made a substantial number of errors producing correct names for the tool arguments. This is largely the reason that GPT-OSS-120b performed so well overall in this segment of the benchmark.

With fewer parameters to fill, the same models made very few such errors when using the SEL-BIRD tool collection, but they made many more errors selecting the correct tools, reflecting the increased difficulty of choosing from a larger (and dynamic) tool set.

As shown above, for the 1597 samples in the tool selection capability, Gemini-3-flash-preview outperforms the other models tested on all error categories.

As expected, since the dashboard API instances require the models to choose from a large number of tool options, but each tool requires only a small number of parameters, there are a large number of errors in tool selection and parameter value selection.

There seems to be little problem with hallucinating or skipping required parameters. However, even when all tool calls are made correctly, models (especially Gemini-3-flash-preview and Claude-Sonnet-4-5 still struggle to synthesize a correct answer from the tool responses, as evidenced by the large drop-offs at the far right side of the plot.

Multi-Hop Reasoning: Effect of Hop Depth on Model Performance

Fig 8: Comparison of Accuracy Across Models by Hop Depth

Multi-hop reasoning increases the difficulty of the original task by requiring models to successfully answer multiple implicitly coupled questions, each of which requires selecting and calling the correct API. As expected, all models performed best on the questions with only a single logical hop, and saw performance degradations on 2-hop and again on 3+ hop questions.

Multi-Hop Multi-Source Reasoning: Effect of Hybrid Hops on Model Performance

Fig 9: Model Accuracy Rates by Interaction Type (API, Document-Retriever, Hybrid)

The final segment of the dataset includes document sources in addition to the tool/API sources in the other segments. This leads to instances that require single or multiple API calls, single or multiple document searches, or some combination of API calls and document searches.

As before, there is a marked difference in performance on instances that require single API calls (1-hop API) as compared to those that require multiple API invocations (2-hop API), and including document retrievers makes the task more challenging (RAG Hops and Hybrid).

Interestingly, we find that on questions that require a single document retriever call (1-hop RAG), GPT-OSS-120B tries to directly return the answer from parameter knowledge, though when the question appears to require multiple hops, it answers the question. We hypothesize that since the questions for 1-hop RAG are very Wikipedia-entity focussed the model skips the tool call (we don't see this problem on 1-hop API, where back-end database-specific entities/facts might be present more frequently in the question).

It is also interesting that the performance of Gemini-3-flash-preview shoots up on 2-hop API-RAG as compared to other hybrid hop-patterns. This is likely explained by the relatively strong performance of Gemini-3-flash-preview on the dashboard APIs (Tool Selection Capability), and thus, once the correct intermediate answer is identified using the tool-call, the retrieval query is likely to be more successful.

Effect of Policies on Model Performance

Fig 10: Model Accuracy Rates by Policy Type

Policies introduce an additional layer of difficulty on top of multi-hop, multi-source reasoning. When policies align with the required source for answering i.e. they do not affect the tool list required for models to answer the question, we refer to it as "No Updates to Answer" -- as shown in Figure 10, all models except for Granite-4.0-h-Small-32B experience a clear drop in performance under policy constraints that restrict access to the most relevant information source (i.e. "Policy updates the answer").

In general, we find that models either violate constraints or fail to retrieve sufficient information, where they sometimes understood the policy but could not answer the question correctly, or they exhibit one of the previously analyzed failure modes.

Overall, tool-use policy-constrained settings suggest that while models can reason over tools and sources, they struggle to incorporate external constraints into that reasoning - often a key requirement for reliable real-world deployment.

VAKRA exposes a critical gap between surface-level tool competence and robust, end‑to‑end agent reliability. Although modern models can increasingly select APIs and execute isolated tool calls, VAKRA shows that these abilities alone are insufficient for real‑world deployment. In practice, models often break down when required to perform compositional reasoning under execution constraints—spanning APIs, documents, dialog context, and policy requirements.

Try VAKRA — Where Does Your Agent Break?

Think your agent is solid? Put it to the test.

Run it on VAKRA and see where it falls apart—tool selection, multi-hop reasoning, or policy constraints.

⭐ Submit to the leaderboard: https://github.com/IBM/vakra?tab=readme-ov-file#submitting-to-the-live-leaderboard

📦 Explore the dataset: https://huggingface.co/datasets/ibm-research/VAKRA

🛠️ Check out the code: https://github.com/IBM/vakra

👉 Try it and tell us what your agent learned

この記事をシェア

Ars Technica AI★42026年6月5日 05:44

ロシアのプロパガンダに抵抗する能力において最も優れた大規模言語モデルとは

エストニア言語研究所は、外国の敵対国が推進する危険なプロパガンダを拡散する懸念に対応するため、大規模言語モデルがロシア連邦の戦略的トピックに対して立場を取らない能力を評価する「プロパガンダ抵抗」ベンチマークを発表した。

TLDR AI★42026年5月22日 09:00

Qwen3.7：エージェントの最前線（15 分読了）

アリババの Qwen チームが、ターミナルベンチや SWE-Pro など複数の評価基準で最高スコアを記録する専用エージェント基盤モデル「Qwen3.7-Max」を発表した。

Interconnects★42026年5月17日 02:00

最新オープンアーティファクト（#21）：Gemma 4、DeepSeek V4、Kimi K2.6、MiMo 2.5、GLM-5.1 など。CAISI の V4 評価について

Interconnects は今月の主要なオープンモデル（Gemma 4 や DeepSeek V4 など）を紹介し、AI 標準化・イノベーションセンター（CAISI）がこれらのモデルを評価した結果、米国製最先端モデルとの格差が拡大していることを報告しました。

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年4月15日 21:07·約5分で読める

VAKRAの内部：エージェントの推論、ツール使用、失敗モード

#AIエージェント #ベンチマーク #ツール使用 #推論 #API連鎖 #エンタープライズAI

TL;DR

AI深層分析2026年4月16日 03:44

重要/ 5段階

深度40%

キーポイント

実践的なエージェント評価環境の構築

複合的な推論能力の測定

現在のモデルの限定的な性能

記事ではモデルがVAKRAで低い性能を示しており、異なるタスクにおける失敗モードの詳細な分析が行われている。

API連鎖能力の具体的評価

SEL-BIRDコレクションの特殊化ツール

REST-BIRDコレクションのAPI選択課題

OpenAI APIのツールリスト制限対応

最大128ツールというAPI仕様制限に対応するため、ベースラインエージェントでは短縮リスト機構を実装してツールリストの長さを管理している。

影響分析・編集コメントを表示

影響分析

編集コメント

記事に戻る VAKRAの内部: エージェントの推論、ツール利用、および失敗モード

Upvote 3

VAKRA データセット | リーダーボード | リリースブログ | GitHub | リーダーボードに投稿

タスクの説明

下図に示すように、VAKRAベンチマークは4つのタスクで構成され、それぞれが異なる能力セットをテストします。

図1: VAKRAベンチマークにおける各能力の代表的な例

能力1: ビジネスインテリジェンスAPIを用いたAPIチェイニング

図2: SEL-BIRDコレクションからのデータサンプル

tool_universe_id

図3: get_data関数から得られたデータプレビュー

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis/

能力2: ダッシュボードAPIを用いたツール選択

能力3: ダッシュボードAPIを用いたマルチホップ推論

図4: 能力3（マルチホップ）およびハイブリッドホップタイプのAPIホップタイプ分布

原文を表示

Back to Articles Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Upvote 3

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Task Description

As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.

Fig 1: Representative examples of each capability in the VAKRA benchmark

Capability 1: API Chaining using Business Intelligence APIs

Fig 2: Data sample from SEL-BIRD collection

tool_universe_id

sort_data_ascending

sort_data_descending

Fig 3: Data preview obtained from get_data function

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis/

Capability 2: Tool Selection using Dashboard APIs

Capability 3: Multi-Hop Reasoning using Dashboard APIs

Fig 4: API Hop-Type distribution for Capability 3 (MultiHop) and Hybrid Hop-Type distribution for Capability 4 (MultiHop MultiSource Reasoning)

Capability 4: Multi-Hop, Multi-Source Reasoning and Policy Adherence

Evaluation Framework

Evaluation Metric

The evaluation follows a waterfall-style pipeline (Figure 6), where later stages are conditioned on earlier success:

For Capability 4 tasks, policy adherence is first verified programmatically (this step is not applied to other capabilities).

The predicted tool call sequence is then compared against the ground truth sequence.

Only samples with valid trajectories proceed to final response evaluation.

Fig 6: Waterfall-style Evaluation Pipeline

This design ensures that agents are rewarded not only for producing correct answers, but for obtaining them through valid and complete reasoning processes.

Every capability is equally weighted to obtain a final leaderboard score

To obtain a capability score, every sample within a capability is equally weighted for capabilities 1 through 3.

For capability 4, we weight heterogeneous queries higher:

Failure Stage Isolation

GPT-OSS-120B outperformed the other models by a large margin, mostly from a better understanding of the tool schemas.

Fig 7: SEL-BIRD vs SLOT-BIRD Error Types Analysis

As shown above, for the 1597 samples in the tool selection capability, Gemini-3-flash-preview outperforms the other models tested on all error categories.

Multi-Hop Reasoning: Effect of Hop Depth on Model Performance

Fig 8: Comparison of Accuracy Across Models by Hop Depth