Hugging Face Blog·2026年2月12日 09:00·約3分

実践におけるOpenEnv：現実世界環境でのツール利用エージェントの評価

#AIエージェント #ベンチマーク #オープンソース #実世界評価

TL;DR

OpenEnvを実際の環境で使用し、ツールを活用するエージェントの性能を評価する研究についての記事です。

AI深層分析2026年2月24日 15:40

重要/ 5段階

キーポイント

研究環境と実運用環境の評価ギャップを埋めるオープンソースフレームワーク「OpenEnv」の実践的導入

実世界の複雑性（権限制御、時間的推論、部分的情報、エラー回復）を備えたカレンダー管理環境「Calendar Gym」をベンチマークとして提供

AIエージェントの評価をシミュレーションから実システム連携へ移行し、信頼性と実用性の検証を可能にする枠組みの確立

影響分析・編集コメントを表示

影響分析

この取り組みは、AIエージェントの研究評価と実世界での信頼性の間にある重要なギャップに直接取り組むもので、業界全体のベンチマーク標準化と実用的な評価手法の確立に寄与する可能性が高い。MetaとHugging Faceの協業によるオープンソースフレームワークという形で提供されるため、広範な採用とエコシステムの発展が期待できる。

編集コメント

AIエージェントの「デモで動く」から「実世界で信頼できる」への転換点を示す重要なフレームワーク。カレンダーという日常的なツールを複雑なベンチマーク環境として活用する発想が実用的で秀逸。

記事要約：実環境におけるツール利用AIエージェントの評価フレームワーク「OpenEnv」の実践

AIエージェントは研究環境では高い性能を示す一方で、実際のシステムに導入されると、複数ステップにわたる推論、実ツールやAPIとの相互作用、不完全な情報下での動作、状態を保持する権限付き環境でのエラー回復などが求められ、しばしば苦戦する。この研究の成功と実運用の信頼性の間には、依然として大きな隔たりが存在する。

この課題に対処するため、MetaとHugging Faceはオープンソースフレームワーク「OpenEnv」を開発した。その目的は、エージェントが実環境と相互作用する方法を標準化することにある。この協業の一環として、Turing社は本番環境レベルのカレンダー管理環境を提供し、アクセス制御、時間的推論、マルチエージェント調整といった現実的な制約下でのツール利用エージェントの研究に貢献した。

OpenEnvは、シミュレーションではなく実システムに対してAIエージェントを評価するためのフレームワークである。エージェントを実ツールやワークフローに接続する標準化された方法を提供しつつ、一貫性と信頼性のある評価に必要な構造を維持する。OpenAIのGymnasiumと同様のジム指向API（リセット、ステップ、アクション、観測）を採用し、様々な領域や環境（シミュレーションから本番環境まで）で一貫したインターフェースを提供する標準的なMCPツール呼び出しインターフェースを使用する。環境は複数のアクションにわたって状態を保持するため、長期的な推論が可能となり、ブラウザ、コードリポジトリ、カレンダーなどの実APIやツールに直接接続できる。これにより、評価の焦点は「制御されたデモで機能するか」から「実世界で確実に動作するか」へと移行する。

カレンダージム：本番環境レベルのベンチマーク

カレンダーシステムは一見単純だが、実は複雑である。会議のスケジュール設定は単純に見えるが、現実世界のカレンダー管理では、エージェントは時間、権限、複数ユーザー、不完全な情報を推論し、しばしばいくつかの依存するステップを跨ぐ必要がある。これらの特性により、カレンダーは制御されたシミュレーションの外でツール利用エージェントを評価するための強力なテストベッドとなる。

Turing社が構築した「カレンダージム」は、このような現実的で要求の厳しいユースケースにOpenEnvを根ざすものだ。抽象的なスケジューリングをシミュレートするのではなく、エージェントを実カレンダーシステムで直面するのと同じ制約に晒す。具体的には、ユーザーやカレンダー間のアクセス制御リスト、他のユーザーの状態への限定的な可視性、アクションを正しい順序で連鎖させる必要があるマルチステップのワークフローなどである。エージェントは、カレンダーの一覧表示からイベントや権限の変更に至る豊富な操作セットと対話し、

原文を表示

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments Back to Articles OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

AI agents often perform impressively in controlled research settings, yet struggle when deployed in real-world systems where they must reason across multiple steps, interact with real tools and APIs, operate under partial information, and recover from errors in stateful, permissioned environments—highlighting a persistent gap between research success and production reliability.

OpenEnv is an open-source framework from Meta and Hugging Face designed to address this challenge by standardizing how agents interact with real environments. As part of this collaboration, Turing contributed a production-grade calendar management environment to study tool-using agents under realistic constraints such as access control, temporal reasoning, and multi-agent coordination.

In this post, we explore how OpenEnv works in practice, why calendars serve as a powerful benchmark for real-world agent evaluation, and what our findings reveal about the current limitations of tool-using agents.

OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardized way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation.

OpenEnv uses a gym-oriented API (reset, step, action, observations) like OpenAI's Gymnasium. Also, OpenEnv uses a standard MCP tool call interface to connect to envs which provides a consistent interface across domains and simulation to production environments.

The environments maintain state across multiple actions—enabling long-horizon reasoning—and can connect directly to real APIs and tools such as browsers, code repositories, or calendars. This shifts evaluation from "Can this work in a controlled demo?" to "Can this operate reliably in the real world?"

The Calendar Gym: A Production-Grade Benchmark

Calendar systems are deceptively complex. While scheduling a meeting seems simple, real-world calendar management requires agents to reason over time, permissions, multiple users, and incomplete information—often across several dependent steps. These properties make calendars a powerful testbed for evaluating tool-using agents outside controlled simulations.

To ground OpenEnv in this kind of realistic, demanding use case, Turing built a production-grade calendar management environment referred to as the Calendar Gym. Rather than simulating scheduling in the abstract, it exposes agents to the same constraints they would face in real calendar systems: Access Control Lists across users and calendars, limited visibility into other users' state, and multi-step workflows where actions must be chained in the correct order. Agents interact with a rich set of calendar operations—from listing calendars to modifying events and permissions—and must handle failed actions, incorrect assumptions, and missing permissions. Each session runs in an isolated environment, enabling reliable comparisons across runs.

Below is a code example of how to use the Calendar Gym. We explore the environment, discover available tools, list calendars, create an event, and print the result.

from openenv_wrapper.client import MCPEnvClient from openenv_wrapper.data_models import MCPAction with MCPEnvClient.from_hub(base_url="TuringEnterprises/calendar-gym") as client: # Connect and reset the environment result = client.reset() print("Reset successful:", result.observation.success) # Discover available tools result = client.step(MCPAction(action_type="ListToolsAction")) print("Available tools:", len(result.observation.tools_list)) # List calendars result = client.step(MCPAction( action_type="ToolCallAction", tool_name="calendars_list", arguments={} )) calendars = result.observation.tool_result["items"] print("Calendars:", calendars) # Create an event result = client.step(MCPAction( action_type="ToolCallAction", tool_name="events_insert", arguments={ "calendarId": "primary", "summary": "Team Sync", "start": {"dateTime": "2026-01-15T14:00:00Z"}, "end": {"dateTime": "2026-01-15T15:00:00Z"} } )) print("Event created:", result.observation.success) Below is an excerpt of what the Calendar Gym returns when you call ListToolsAction. Each entry includes the tool name plus an input schema (what arguments the tool accepts).

{ "tools_list": [ { "name": "calendars_list", "description": "List calendars visible to the current user.", "input_schema": { "type": "object", "properties": {}, "additionalProperties": false } }, { "name": "events_insert", "description": "Create an event in a calendar.", "input_schema": { "type": "object", "properties": { "calendarId": { "type": "string" }, "summary": { "type": "string" }, "start": { "type": "object", "properties": { "dateTime": { "type": "string" } }, "required": ["dateTime"] }, "end": { "type": "object", "properties": { "dateTim

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む