音声エージェント評価の新フレームワーク(EVA)
Hugging Faceは、音声エージェントの精度と会話体験を統合的に評価する初のフレームワーク「EVA」を発表し、航空会社データセットと20システムのベンチマーク結果を公開した。
キーポイント
音声エージェント評価の新フレームワーク
EVAは、精度(EVA-A)と会話体験(EVA-X)を同時に評価する初のエンドツーエンド評価フレームワークであり、現実的なボット間対話アーキテクチャを使用する。
精度と体験のトレードオフ発見
20のカスケードおよびオーディオネイティブシステムのベンチマーク結果から、タスク完了性能が高いエージェントほどユーザー体験が悪化する一貫したトレードオフが確認された。
航空会社データセットの公開
フライト再予約、キャンセル対応、バウチャー処理など50のシナリオをカバーする初期データセットを公開し、今後複数ドメインへの拡張を計画している。
実用的な評価アプローチ
既存のフレームワークがタスク成功と会話ダイナミクスを別々に評価していたのに対し、EVAは両方を統合的に評価することで実用的な課題を表面化させる。
既存評価手法の限界
既存の評価手法は音声理解や会話ダイナミクスを個別に評価するが、実践的な会話ワークフロー全体でのエージェント能力を評価する統合的な枠組みが不足している。
EVAの評価アプローチ
EVAはボット間の音声会話をシミュレーションし、適切なツール呼び出し、タスクポリシー遵守、検証可能な最終状態到達を評価する統合的な枠組みを提供する。
エンドツーエンド評価の重要性
コンポーネントレベルでは明らかにならない、中断、回復、遅延が会話フローに与える影響などの相互作用ダイナミクスを評価できる。
影響分析・編集コメントを表示
影響分析
このフレームワークは音声AIの実用化において重要な評価基準を提供し、開発者が精度とユーザー体験のバランスを最適化するための指針となる。オープンソースとして公開されることで、業界全体の評価標準化と研究の加速が期待できる。
編集コメント
音声AIの実用化において長らく課題だった評価方法に明確な解決策を提示した点で価値が高い。特に精度と体験のトレードオフを実証したことは、今後の開発方針に影響を与えるだろう。
会話の推進力 [LLM-as-Judge]。エージェントが会話を効果的に前進させたかどうかを測定します。繰り返しを避け、ターン間で文脈を保持し、停滞することなくタスク完了に向けて推進する能力を評価します。
ターン取得 [LLM-as-Judge]。エージェントが適切なタイミングで発話したかどうかを測定します。ユーザーの発話を中断せず、またユーザーが話し終えた後の過度な沈黙も生じさせないことが評価基準です。
私たちは20のシステム(プロプライエタリとオープンソース、カスケード型とオーディオネイティブ)を評価し、一貫した精度と体験のトレードオフを発見しました。タスク完了で良好なパフォーマンスを示すエージェントは、ユーザー体験が悪化する傾向があり、その逆も同様です。これは、タスク完了のみを評価するベンチマークでは見えないトレードオフです。両方の軸を支配する単一の構成は存在せず、精度と体験は統合的に測定される必要があることが確認されました。
さらに、固有名詞の音声認識が主要な失敗モードであることを特定しました。単一の聞き間違えられた文字が連鎖的に認証失敗や会話の完全な崩壊につながる可能性があります。また、多段階のワークフローは予測可能な方法でエージェントを破綻させます。付帯サービス(座席、手荷物)を維持しながらフライトを再予約することは、すべての構成において主要な複雑性破壊要因です。最後に、実世界のユースケースには追加の調整が必要であることを観察しました。pass@3とpass^3の間のギャップは、すべての構成において実質的なものです。タスクを完了できるエージェントでさえ、一貫して完了できないことが多く、これは実世界での成功にとって重要です。
初期結果はこちらでご覧ください。
EVA-Benchは、会話型音声エージェントの厳密なエンドツーエンド評価を提供するように設計されていますが、フレームワーク、データ、メトリクスの次元にわたって認識すべきいくつかの制限があります。
フレームワーク: ユーザーシミュレーターは単一の商用プロバイダーに依存しており、その音声特性は特定のASR(自動音声認識)システムを体系的に有利にする可能性があります。また、オーディオ形式変換やリアルタイムオーディオインターフェースを含むボット間パイプラインは、本番環境のデプロイメントを完全には再現していない可能性があります。さらに、完全な再現には商用APIアクセスが必要であり、レイテンシ測定はプロバイダーやインフラストラクチャによって異なります。
データ: 現在のリリースは単一ドメインにおける50の英語シナリオをカバーしています。結果は他のユースケース、言語、アクセントには一般化できない可能性があります。
メトリクス: LLM-as-judgeモデルは固有のバイアスを伴い、品質とは無関係に特定の応答スタイルを好む可能性があります。評価対象モデルと判定モデルが同じプロバイダーを共有する場合、体系的なバイアスの追加リスクがあります。私たちは判定モデルをラベル付きデータセットに対して検証し、ウェブサイトで精度測定値を報告していますが、これらの整合性スコアは体系的なバイアスを完全には排除しません。さらに、タスク完了は二値として測定されており、部分的な成功を捕捉せず、優雅に失敗するシステムと壊滅的に失敗するシステムの相対的な品質を過小評価する可能性があります。
評価面では、韻律品質評価(発音、リズム、表現力)を追加する予定です。これは現在、LALM-as-Judgeと人間の判断の間に非常に低い整合性が見つかった後の未解決問題です。また、騒音条件下、多様なアクセント、多言語ユーザー、様々な話者行動におけるロバストネステスト、およびユーザーの苦痛に対するエージェントの応答を評価する感情認識評価も計画しています。データ面では、追加のドメインデータセット(それぞれ異なるポリシー構造、固有名詞プロファイル、会話ダイナミクスを持つ)と、複合リクエスト、多段階のフォローアップ、より長い会話メモリを含むより複雑なシナリオを開発中です。ツール面では、メトリクスとモデルごとにエラーを自動的に識別し、探索のための代表的な例を提示し、各モデルの強みと弱みの構造化された要約を生成する結果とエラー分析アプリケーションをリリースします。最後に、リーダーボードを継続的に拡張し、分野全体における音声エージェントの能力の最新の評価を提供することを意図しています。
制限事項と今後のロードマップの詳細はこちらでご覧ください。
始め方
フレームワークを使用するには、私たちのGitHubにアクセスしてください!
謝辞
主要な貢献者には、Tara Bogavelli、Gabrielle Gauthier Melançon、Katrina Stankiewicz、Oluwanifemi Bamgbose、Hoang Nguyen、Raghav Mehndiratta、Hari Subramaniが含まれます。
また、Lindsay Brin、Akshay Kalkunte、Joseph Marinier、Jishnu Nair、Aman Tiwariには、慎重なデータレビューとフレームワークへの思慮深い貢献に感謝します。Fanny Riols、Anil Madamala、Sridhar Nemala、Srinivas Sunkaraには、プロジェクト全体を通じた管理、リーダーシップ、サポートに感謝します。さらに、評価と音声エージェントに関する以前の作業がこのプロジェクトに貴重なインスピレーションを提供したPAVAおよびCLAE ServiceNowチームにも感謝の意を表します。
@misc{eva-2026, title={A New Framework for Evaluation of Voice Agents (EVA)}, author={Bogavelli, Tara and Gauthier Melançon, Gabrielle and Stankiewicz, Katrina and Bamgbose, Oluwanifemi and Nguyen, Hoang and Mehndiratta, Raghav and Subramani, Hari}, year={2026}, url={https://github.com/ServiceNow/EVA-Bench} }








原文を表示
Back to Articles A New Framework for Evaluation of Voice Agents (EVA)
Upvote 4
Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives — accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not both.
We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the first to jointly score task success and conversational experience. We release EVA with an initial airline dataset of 50 scenarios covering flight rebooking, cancellation handling, vouchers, and more — the first in a planned series of domains.
We also provide benchmark results for 20 cascade and audio-native systems, such as speech-to-speech models and large audio language models. Our biggest finding is that there is a consistent Accuracy-Experience tradeoff; agents that perform well on task completion tend to deliver worse user experiences, and vice versa.
🌐 Website - Explore the full framework, early results, and a demo.
💻 GitHub - Dive into the code, dataset, and judge prompts.
Background and Motivation
The field currently lacks a framework that evaluates the full quality of voice agent interactions, as most existing efforts assess individual components in isolation. For example, AudioBench, SD-Eval, VoxEval, Kimi-Eval, VoiceBench and VoxDialogue evaluate core speech understanding capabilities — transcription, paralinguistics, acoustic cues — but remain confined to single-turn, non-interactive settings. On the other hand, EmergentTTS and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Score). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics — interruptions, backchanneling, turn-taking — yet evaluate these in isolation from task-oriented tool use, leaving the relationship between dialogue quality and agentic capability unexamined. More recent efforts, notably VoiceAgentBench and CAVA, take steps towards evaluating the agentic capabilities of commercial voice agent systems, including tool-calling and complex instruction-following. However, these voice-agentic capabilities are not evaluated within complete conversational workflows that voice agents must navigate in practice: from initial user request through multi-step tool orchestration to final task resolution.
The lack of frameworks that jointly capture accuracy and experience underscores the need for a framework that treats voice agent quality as an integrated whole. This means evaluating not only whether the task succeeded, but whether the agent communicated accurately, concisely, and naturally throughout, and surfacing how these dimensions trade off against one another in realistic deployment conditions.
End-to-end evaluation reveals interaction dynamics that are not apparent at the component level: whether the agent interrupts users during natural pauses in speech, whether it recovers smoothly when a user corrects a transcription error, or whether high latency disrupts the conversational flow enough to prompt users to repeat themselves or abandon the task entirely.

EVA simulates multi-turn spoken conversations over live audio in which the agent must invoke appropriate tools, adhere to task-specific policies, and reach a deterministically verifiable end state. EVA evaluates voice agents using a bot-to-bot audio architecture composed of five core components:
User Simulator — A conversational AI configured with a specific goal and persona that plays the role of a caller. It operates in audio using high-quality TTS models, ensuring the evaluation captures representative speech-understanding challenges in natural-sounding conversational speech and realistic turn-taking dynamics.
Voice Agent — The voice agent being evaluated, built with Pipecat, an open-source Python framework for real-time voice applications. EVA supports both cascade architectures (STT → LLM → TTS) and audio-native models (S2S or S2T→ TTS).
Tool Executor — The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
Validators — A set of validation metrics that check that conversations are complete and that the user faithfully reproduced the intended behavior and speech, with no human annotation required. Any conversation that fails in this validation step is regenerated, ensuring that only valid, correctly executed conversations enter evaluation. This stands in contrast to approaches that rely on post-hoc human labeling to identify simulator errors.
Metrics Suite — A suite of metrics evaluates the voice agent using the conversation recording, transcript, and tool call logs.
Each test case (scenario) in our framework is an evaluation record, structured to make tests reproducible:
User Goal — What the caller is trying to accomplish. Includes a highly specific user objective with an exact decision tree that guides the user simulator through the conversation, leaving no ambiguity about the intended outcome.
User Persona — How the caller should behave — their speaking style, patience level, and personality traits.
Scenario Database — The backend data the agent's tools will query.
Ground Truth — The expected final state of the scenario database after a successful conversation.
We release EVA with a synthetic airline dataset of 50 scenarios, spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. Scenarios are designed to test temporal reasoning, policy-following, constraint satisfaction, and named-entity handling.
Evaluation Methodology
EVA evaluates voice agents across two fundamental dimensions, EVA-A for accuracy, and EVA-X for experience. EVA also includes a set of diagnostic metrics. Unlike the primary metrics, these are not used directly to compare or rank models — rather, they offer granular insight into why a model scores the way it does, helping identify and understand specific failure modes (e.g., ASR, speech synthesis, etc.). We report pass@k (the probability that at least one of k runs succeeds) and pass^k (the probability that all k runs succeed) across three trials per scenario (k = 3), capturing both peak performance and behavioral consistency.

EVA uses two evaluation methods: deterministic code-based metrics, which compute scores directly from structured data and are fast; and LLM-as-Judge metrics, which use Large Language Models (LLMs) to assess qualitative aspects of the conversation, or Large Audio Language Models (LALM) to evaluate speech directly. Each judge-based metric uses the model that performs best on a curated evaluation dataset for that specific metric.
EVA-A: Accuracy
Task completion alone is a necessary but insufficient measure of accuracy. An agent can reach the correct end state while fabricating a policy detail, misreading a confirmation code aloud, or hallucinating a flight number mid-conversation. These failures are invisible to a binary pass/fail check but directly harm users. EVA-A therefore measures three dimensions of accuracy:
Task Completion [Deterministic]. Measures whether the agent correctly completed the task by comparing the expected end state of the scenario database against the actual end state after the conversation.
Faithfulness [LLM-as-Judge]. Measures whether the agent's responses were grounded in its instructions, policies, user inputs, and tool call results — flagging fabrications, misrepresentations, policy violations, and hallucinations.
Agent Speech Fidelity [LALM-as-Judge]. Measures whether the speech system faithfully reproduced the intended text in spoken audio, with particular focus on entities critical to get right in a voice context, such as confirmation codes, flight numbers, and dollar amounts. This is the only metric in any end-to-end voice agent benchmark that evaluates the quality of the agent's own spoken output at the audio level.
EVA-X: Experience
Turn-taking timing matters, but it tells only part of the story. An agent can have perfect timing while overwhelming a caller with a wall of spoken options they cannot skim, or repeatedly asking for information already given. These failures degrade the experience without ever involving a mistimed response. EVA-X therefore measures three dimensions of experience:
Conciseness [LLM-as-Judge]. Measures whether the agent's responses were appropriately brief and focused for spoken delivery, since phone users cannot skim, re-read, or scroll back through long responses.
Conversation Progression [LLM-as-Judge]. Measures whether the agent moved the conversation forward effectively — avoiding repetition, retaining context across turns, and driving toward task completion without stalling.
Turn-Taking [LLM-as-Judge]. Measures whether the agent spoke at the right time — neither interrupting the user nor introducing excessive silence after they finish speaking.
We evaluated 20 systems — proprietary and open-source, cascade and audio-native — and find a consistent accuracy-experience tradeoff: agents that perform well on task completion tend to deliver worse user experiences, and vice versa — a tradeoff invisible to benchmarks that score only task completion. No single configuration dominates both axes, confirming that accuracy and experience must be measured jointly.
Additionally, we identified named entity transcription as a dominant failure mode. A single misheard character can cascade into an authentication failure and a full conversation breakdown. Also, multi-step workflows break agents in predictable ways. Rebooking a flight while preserving ancillary services — seats, baggage — is the dominant complexity breaker across all configurations. Finally, we observed that additional calibration is needed for real-world use cases. The gap between pass@3 and pass^3 is substantial across all configurations. Even agents that can complete a task often cannot do so consistently, which is critical for real-world success.
View the early results here.
EVA-Bench is designed to provide rigorous, end-to-end evaluation of conversational voice agents, but several limitations are important to acknowledge, across the framework, data, and metrics dimensions:
Framework: The user simulator relies on a single commercial provider whose voice characteristics may systematically favor certain ASR systems, and the bot-to-bot pipeline — including audio format conversions and real-time audio interfaces — may not fully represent production deployments. Also, full reproduction requires commercial API access, and latency measurements will vary across providers and infrastructure.
Data: the current release covers 50 English-language scenarios in a single domain; results may not generalize to other use cases, languages, or accents.
Metrics: LLM-as-judge models carry inherent biases and may favor certain response styles independent of quality, with additional risk of systematic bias when the evaluated and judge models share a provider. While we validate our judges against labeled datasets and report accuracy measurements on our website, these alignment scores do not eliminate systematic bias entirely. Additionally, task completion is measured as binary, which does not capture partial credits and may understate the relative quality of systems that fail gracefully versus catastrophically.
On the evaluation side, we plan to add prosodic quality assessment (pronunciation, rhythm, expressiveness) — currently an open problem after finding very low alignment between LALM-as-Judge and human judgments. We also plan robustness testing under noisy conditions, diverse accents, multilingual users, and varied speaker behaviors, alongside affect-aware evaluation of how agents respond to user distress. In terms of data, we are developing additional domain datasets — each with distinct policy structures, named entity profiles, and conversational dynamics — and more complex scenarios involving compound requests, multi-step follow-ups, and longer conversational memory. On the tooling front, we will release a results and error analysis application that automatically identifies errors per metric and model, surfaces representative examples for exploration, and generates structured summaries of each model’s strengths and weaknesses. Finally, we intend to expand the leaderboard continuously to provide an up-to-date assessment of voice agent capabilities across the field.
View more details about limitations and our upcoming roadmap here.
Getting Started
Go to our GitHub to use the framework!
Acknowledgements
Core contributors include Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, and Hari Subramani.
We also thank Lindsay Brin, Akshay Kalkunte, Joseph Marinier, Jishnu Nair, and Aman Tiwari for their careful data review and thoughtful contributions to the framework, and Fanny Riols, Anil Madamala, Sridhar Nemala, and Srinivas Sunkara for their management, leadership, and support throughout. We also extend our thanks to the PAVA and CLAE ServiceNow teams, whose prior work on evaluations and voice agents provided valuable inspiration for this project.
@misc{eva-2026, title={A New Framework for Evaluation of Voice Agents (EVA)}, author={Bogavelli, Tara and Gauthier Melançon, Gabrielle and Stankiewicz, Katrina and Bamgbose, Oluwanifemi and Nguyen, Hoang and Mehndiratta, Raghav and Subramani, Hari}, year={2026}, url={https://github.com/ServiceNow/EVA-Bench} }








関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み