Hugging Face Blog·2026年2月19日 01:15·約3分

IBMとUCバークレー、IT-BenchとMASTを用いてエンタープライズエージェントの失敗原因を診断

#AIエージェント #ベンチマーク #LLM信頼性 #エンタープライトAI

TL;DR

IBMとUCバークレーはIT-BenchとMASTというツールを開発し、企業向けAIエージェントが失敗する原因を分析・特定する手法を確立しました。

AI深層分析2026年2月24日 07:40

重要/ 5段階

キーポイント

IBMとUCバークレーが、エンタープライト向けAIエージェントの失敗原因を診断する手法（MAST）とベンチマーク（IT-Bench）を開発・適用した

分析により、最先端モデル（Gemini）と大規模OSSモデル（GPT-OSS-120B）では失敗パターンが異なり、前者は検証不足、後者は連鎖的な推論失敗が多いことが判明

エージェント設計への実践的示唆として、「検証の外部化」「終了条件の外部管理」「曖昧さへの明示的対応」の3原則が提唱された

影響分析・編集コメントを表示

影響分析

この研究は、単なる性能スコアではなく、エージェントが「なぜ失敗するか」を構造的に診断する方法論を提供し、実運用レベルの信頼性向上に寄与する。エンタープライトIT自動化のような高リスク領域でのAI導入を、デバッグ可能で介入可能な形に前進させる。

編集コメント

「AIはなぜ間違えるのか」をエンジニアリング可能な形で解き明かす研究。実務家向けの具体的な設計指針が示されており、現場の開発チーム必見の内容。

IBMとカリフォルニア大学バークレー校（UC Berkeley）は共同研究を行い、エンタープライズIT自動化におけるエージェント型LLMシステムが、実際の運用でどのように失敗するのかを詳細に診断しました。研究では、インシデントのトリアージ、ログやメトリクスの照会、Kubernetes操作など、長いツール使用ループを伴うタスクを対象としています。

従来のベンチマークは、性能を単一の数値に集約するため、エージェントが「失敗したかどうか」は分かっても、「なぜ失敗したのか」は分からない「ブラックボックス」問題がありました。この問題を解決するため、研究チームはエージェントの信頼性を診断する新たな手法「MAST（マルチエージェントシステム障害分類法）」を適用しました。IT自動化の業界標準ベンチマークである「ITBench」の実行トレースをMASTで分析することで、生のデータを構造化された「障害の署名」に変換し、何が、どのように壊れたのかを明確に特定することに成功したのです。具体的には、ITBenchのSREタスクにおける310の実行トレースを、Gemini-3-Flash、Kimi-K2、GPT-OSS-120Bという3つの異なるクラスのモデルで注釈付けし、分析しました。

分析から得られた主要な発見は以下の通りです：

モデルクラスによる失敗パターンの顕著な差異：

最先端モデル（Gemini-3-Flashなど）：失敗は比較的単純で（1トレースあたり平均2.6の障害モード）、検証（Verification）などの特定のボトルネックに集中する傾向があります。
大規模オープンモデル（GPT-OSS-120Bなど）：連鎖的な失敗（1トレースあたり平均5.3の障害モード）に悩まされます。実行初期のわずかな推論の誤りがコンテキストを汚染し、幻覚（ハルシネーション）が雪だるま式に増幅するパターンが見られました。

全モデルに共通する最大の失敗要因：最も強力な失敗の予測因子は「FM-3.3（不正な検証）」でした。エージェントは、実際の結果（グランドトゥルース）を確認することなく、一貫してタスク成功を「宣言」してしまうという問題です。

特定モデルに顕著な弱点：Kimi-K2モデルは、タスクが完了したタイミングを認識する能力に課題があり、「早期終了（Premature Termination）」が46%、「終了条件の認識不全（Unaware of Termination Conditions）」が43%も急増しました。問題解決の直前に放棄したり、無限にループしたりする傾向が見られました。

この分析から、実用的なエージェントを構築する上での重要な教訓が導き出されています：

最先端モデルに対しては「検証の外部化」を：LLM自身に自己採点させてはなりません。終了前に、ツールによる確かな証拠を要求する仕組みが必要です。
終了判定とループ制御はモデルの外に設ける：終了関連の問題（MAST分類FM-1.5）は一般的な

原文を表示

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST Back to Articles IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Ayhan Sebin Saurabh Jha Rohan Arora Daby Sow Mert Cemri Melissa Pan Ion Stoica

ITBench HF Space ITBench HF Dataset MAST HF Dataset ITBench Github MAST Github

IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops.

Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). By leveraging MAST to analyze ITBench—the industry benchmark for SRE, Security, and FinOps automation—we turned raw execution traces into structured failure signatures, revealing exactly what broke and how to fix it. We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B.

Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification. Large open models like GPT-OSS-120B suffer from cascading failure modes (5.3 failure modes/trace). -A single reasoning mismatch early in the run poisons the context, leading to compounding hallucinations.

Across all models, the strongest predictor of failure is FM-3.3 (Incorrect Verification). Agents consistently "declare victory" without checking ground truth.

Kimi-K2 struggles to recognize when a task is done. It exhibits a massive spike in Premature Termination (+46%) and Unaware of Termination Conditions (+43%), often quitting just before solving the problem or looping indefinitely.

Takeaways from our analysis when building agents:

For Frontier Models like Gemini: Externalize Verification. Never let the LLM grade its own homework. Require hard tool evidence before exit.

Put termination + loop control outside the model: Termination issues are common killers (FM-1.5). Add explicit stop conditions + loop detectors for repeated tool calls/actions or implement Finite State Machines.

Force clarify-or-read-only when inputs are ambiguous: Clarification failures (FM-2.2) are a major failure driver for smaller models. Make ambiguity a first-class branch in your agent graph.

If you’re building agents for enterprise IT workflows, this is the kind of evaluation you want: not just “did it pass?”, but “what broke, where, and what intervention is most leverageable?”

The "Black Box" Problem of Agent Benchmarks

Benchmarks like ITBench are becoming the standard for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments.

This benchmarks use success rate as a main metric to evaluate agents. However, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail because it forgot the context? Because it hallucinated a command? Or because it simply did not terminate?

Without a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create another.

As a new standard to analyze the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy). MAST brings more insights and open up the opaque evaluation of these benchmarks. Derived from a rigorous analysis of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures.

MAST converts unstructured execution logs into structured "failure vectors" based on 14 distinct patterns across three key categories:

FC1: System Design Issues (The "Skeleton") Failures here stem from the agent's architecture and role definition.

Examples: FM-1.3 Step Repetition (looping), FM-1.4 Loss of Conversation History (memory leaks), FM-1.5 Unaware of Termination (failing to stop).

FC2: Inter-Agent Misalignment (The "Communication") Failures arising during runtime from how agents talk to each other or the environment.

Examples: FM-2.2 Fail to Ask for Clarification (assuming instead of asking), FM-2.3 Task Derailment (going off-topic).

FC3: Task Verification (The "Quality Control") Failures in quality assurance of the agents' output.

Examples: FM-3.1 Premature Termination (giving up too soon), FM-3.3 Incorrect Verification (hallucinating success).

The Experiment: Diagnosing ITBench Agents

We stress-test the idea of using MAST to make agent evaluations actionable and gain insights on the failure modes by applying it to ITBen

この記事をシェア

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む