Hugging Face Blog·2026年6月13日 00:56·約11分で読める

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

#LLM #MLOps #Evaluation #Hugging Face

TL;DR

Hugging Face は、モデル開発プロセスにおける継続的な検証と改善を支援する評価ワークベンチ「OLMO-EVAL」を発表した。

AI深層分析2026年6月13日 01:07

注目/ 5段階

深度40%

キーポイント

評価ワークベンチの発表

Hugging Face がモデル開発効率化のために、新しい評価ツール「OLMO-EVAL」をリリースした。

継続的検証プロセスの支援

このツールは、開発者がモデル性能を継続的に検証し、改善するサイクルを円滑にサポートすることを目的としている。

開発ループの最適化

従来の開発フローにおけるボトルネックを解消し、より迅速なモデル改良を実現するためのインフラを提供する。

影響分析・編集コメントを表示

影響分析

この発表は、AI モデル開発における標準的な評価プロセスを強化し、より体系的な品質保証を可能にする重要な一歩です。特に、大規模モデルの開発現場では、手動での検証がボトルネックとなりがちですが、このワークベンチの登場により、継続的インテグレーション（CI）のようなアプローチが評価領域でも一般化する可能性があります。

編集コメント

モデル開発の「評価」フェーズに特化したツールが明確に登場した点は、MLOps の成熟を示す象徴的な出来事と言えます。実用性が高く、開発現場の生産性向上に直結するツールです。

記事一覧に戻る

💻 コード：https://github.com/allenai/olmo-eval

大規模言語モデル（LLM）を構築している間、あなたは多くの介入を通じて何度も評価を行います。データ、アーキテクチャ、ハイパーパラメータのあらゆる調整や、スケールアップのすべてのステップは、同じループに戻ります：ベンチマークの追加または再構成、各新しいモデルチェックポイントでの再実行、結果の記録、そして小規模な実験で有効だったものが大規模なトレーニングランでも維持されるかどうかの確認です。

ほとんどの評価ツールはこの目的のために設計されていません。それらは完成したモデルに対して確立されたベンチマークを実行するために作られているか、サンドボックス内で多段階のツール使用問題をモデルに実行させるために作られています。常に変化し続けるモデルに対応できず、また特定の現実世界の条件下でモデルがどのように振る舞うかを反映するものでもありません。

この評価課題に対処するための私たちの最後のプロジェクトは、OLMES、すなわちオープン言語モデル評価標準（Open Language Model Evaluation Standard）でした。2024 年に導入されたこの基準は、異なるリリース間での LLM ベンチマークスコアの比較を容易にするために作られました。同じモデルが、異なる方法で同じベンチマークにスコア付けされていました——プロンプトのフォーマットやタスクの定式化といった側面は論文ごとにしばしば異なっていたため、「どのモデルが最も優れた性能を発揮したか」という主張の多くは再現不可能でした。OLMES はベンチマーキングにおける選択事項を、オープンで文書化された標準に固定し、Olmo から Tulu までの私たちのオープンモデルの評価基準として採用されました。

しかし、モデルの最終スコアは評価プロセスの一部に過ぎません——だからこそ私たちは、olmo-eval をリリースするのです。これは OLMES に基づき、LLM 開発の残りの領域へと拡張された新しいワークベンチです。OLMES と比較して、olmo-eval は新規評価の実装にかかる作業を削減し、評価を実行する場所や方法を定義する際の柔軟性を高め、個々のコンポーネントをより大きなワークフローに組み合わせることを容易にします。エージェント型および多ターン評価（multi-turn evaluation）がファーストクラスユースケースとしてサポートされており、強力な分析ツールによって、介入がベースラインに対して実際に改善をもたらしたのか、それとも単なるノイズの差に過ぎないのかを判断しやすくなっています。

olmo-eval と既存ツールの違い

*パフォーマンスが 2.4 ポイント（pp）変化した程度で、結論を下すのに十分でしょうか？*

olmo-eval は、コンテナ化されたサンドボックス環境内で AI エージェントを評価するためのオープンフレームワークである Harbor と、いくつかの点で重複しています。しかし、両者のツールの範囲は異なります。Harbor は主にエージェントベンチマークの実行と公開を対象としていますが、olmo-eval はモデル開発における日常的な作業のために構築されました。具体的には、ベンチマークの追加や設定、チェックポイント間での実行、そして単一の総合スコアではなく、プロンプトごとに結果を分析する作業です。

Harbor はすべてを同じ方法で実行します—密封された再現可能なコンテナ内です。コンテナはリソースを多く消費するため、olmo-eval では各ベンチマークの実行方法をユーザーが選択できるようになっています。単にモデルが質問に答えるだけで済むベンチマークは直接実行でき、これはより高速かつ低コストです。一方、ロックダウンされた環境が必要なベンチマーク（例えば、モデルが作成したコードを実行するものなど）には、隔離されたコンテナセットアップが用意されます。軽量なパスがデフォルトであり、olmo-eval はベンチマークが実際にそれを必要とする場合にのみ、重いセットアップを選択します。

Harbor のベンチマーク追加プロセスは、公開して共有する予定の評価用であり、それに伴う追加の検証ステップを想定しています。一方、olmo-eval は開発中の迅速な移動のために設計されており、ベンチマークの追加方法はベンチマークが何を必要とするかによります：基本的な評価のための短い定義で、モデルがベンチマークを進める際にツールを使用できるオプション付き、あるいはすでに独自のコードと手順を持つベンチマークの場合は、olmo-eval がそのまま実行し、他のベンチマークスコアと同じ形式で結果を報告できるようにする薄いラッパーです。

Harbor と olmo-eval の両方は、ランタイムポリシー（モデルが回答を生成するためにどのように実行されるか）からベンチマークを分離して保持しており、一方を変更しても他方を再記述する必要はありません。ただし、olmo-eval はより高いモジュール性を備えて設計されています。olmo-eval では、評価対象のモデル、使用可能なツール、コンテナ化された環境、および LLM-as-a-judge（LLM をジャッジとして利用する仕組み）のようなヘルパーモデルはすべて、交換可能なコンポーネントです。ツールを多数のハーンネスで再利用したり、他のベンチマークに影響を与えずに評価用モデルを一つのベンチマークに組み込んだり、プロンプトの正確な wording（文言）などの小さな設定を変更したりすることも、多大な労力なしに行えます。

Harbor は各モデルに対して総合スコアを報告します。olmo-eval も同様にスコアを報告しますが、それぞれに標準誤差と検出可能な最小効果（ノイズから確実に区別できる最小限の差異）が付随しています。しかし、より有用な視点として、同じ質問を 2 つのモデルチェックポイント間で並べて比較し、1 つずつ対比させることができます。これは他の条件をすべて固定した状態で行われます。これにより、総合平均におけるわずかな変化が、実際の改善を示しているのか、単なるノイズに過ぎないのかを見極めることが可能になります。

If you're looking for...

olmo-eval offers

Authoring a multi-example benchmark

Task subclass with a DataSource, metrics, and scoring surface

既存のエージェント型ベンチマークを独自のランナーでラップする

ExternalEval または SandboxedExternalEval; ベンチマークはループとスコアリングを維持し、結果は olmo-eval のスキーマに格納される

固定されたベンチマークの下のランタイムを切り替える

--harness および harness プリセット; ハーネスがプロバイダー、ツール、スケフォールド、サンドボックス、および補助プロバイダーを引き受ける

並列コンテナ実行

能力ベースルーティングを持つ並列実行用のサンドボックスインスタンス。Docker モードまたは Modal モード

タスクやハーネス間で再利用可能なツールの定義

@tool デコレータとオプションのグローバルレジストリ

マルチターン実行ループ

スケフォールド（例：openai_agents）はハーネスごとに選択され、タスク定義に埋め込まれない

統合評価スタック

olmo-eval は、それぞれ単独でも有用ですが、実験的な大規模言語モデル（LLM）の開発ループを強化するために連携して設計された 4 つのコンポーネントで構成されています。

ベンチマークロジックとランタイムポリシーを分離するタスク/スイート/ハーン抽象化。olmo-eval におけるベンチマークの定義方法、すなわち何を評価するかを示すのが「タスク」です。「スイート」は実行するセットとして複数のタスクをグループ化し、「ハーン」は各タスクの実行方法を制御します。この分離により、測定対象を変更することなく、同じタスクを標準的なベースラインとして実行することも、ツールや足場（scaffolding）付きで実行することも可能になります。

非同期サンドボックスプランナーを含むサンドボックスおよび機能ルーティングレイヤー。これは、モデルの応答がツールを使用したアクション（コードの記述と実行やウェブブラウジングなど）に依存する評価をサポートします。目的は、モデルの実際のツール使用を評価することです。ベンチマークでツールの利用が必要となった場合、olmo-eval は該当ツールを実行し、その結果をモデルにフィードバックします。

各ラン、その設定、および結果を同じ構造化形式で記録する正規化された実験スキーマ。これにより、関連する実験をグループ化したり、時間経過に伴うチェックポイントを比較したり、長期にわたるモデル開発ワークフローで蓄積しがちな不整合性を回避することが可能になります。

2 つのモデルを対比して比較するための結果ビューア：質問ごとに 2 つのモデルまたはチェックポイントを並べて表示することで、全体の平均値では隠れてしまう小さなが実在するパフォーマンスの変化を浮き彫りにします。

ほとんどのモデル評価設定では、ベンチマークの追加は大規模な統合プロジェクトとなります。しかし olmo-eval では、必要なものはタスクだけです。タスクはベンチマークデータセットを定義し、評価リクエストがどのように構築され、モデルの回答がどのように採点されるかを決定します（すべてのコードは Python で記述されています）:

from olmo_eval.common.formatters import ChatFormatter

from olmo_eval.common.metrics import AccuracyMetric

from olmo_eval.common.scorers import ExactMatchScorer

from olmo_eval.common.types import Instance, SamplingParams

from olmo_eval.data import DataLoader, DataSource

from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")

class InternalFreshQA(Task):

data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")

formatter = ChatFormatter()

sampling_params = SamplingParams(temperature=0.0)

metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

@property

def instances(self):

loader = DataLoader()

for idx, doc in enumerate(loader.load(self.config.get_data_source())):

yield Instance(

question=doc["question"],

gold_answer=doc["answer"],

metadata={"id": doc.get("id", f"freshqa_{idx}"),}

)

バリアントは、ベンチマークを複製することなく評価ポリシーの変更を表します:

register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)

register_variant("internal_freshqa", "zero", num_fewshot=0)

スイートは、まとめて実行する標準セットとしてベンチマークをグループ化します:

from olmo_eval.evals.suites import Suite, register

name="base_qa_few_shot",

tasks=(

"sciq:mc:3shot",

"arc_challenge:mc:3shot",

"internal_freshqa:mc:3shot",

))

また、ランタイムポリシーがタスク定義ではなくハネス（harness）に存在するため、生成されたポイントトラックが単に妥当に見えるかどうかに依存するのではなく、異なる実行環境で同じベンチマークを簡単に再実行できます。

ベースライン

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

同じタスク、同じスコアリング、検索/ツールランタイム有効化

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

再現可能な評価のオープン化

olmo-eval は、評価が単発の実行ではなく継続的なモデル開発の一部である場合、つまり、再現性のある条件下でチェックポイント間で同じベンチマークを繰り返し実行し、集計レベルと質問ごとの両方で介入を比較する必要がある場合に使用してください。

「このチェックポイントは前回のものとどう異なり、どこで具体的に改善または後退したのか？」という繰り返しの問いがある場合、それが olmo-eval が構築されたワークフローです。

再現可能な評価は、モデルが完成して一度スコアリングされる方法だけでなく、モデルがどのように構築されるかにも合わせて進化するべきです。olmo-eval は OLMES 標準をアクティブなモデル開発に持ち込み、コミュニティがこれを基盤として構築できるようにオープンにリリースしています。

原文を表示

Back to Articles

💻 Code: https://github.com/allenai/olmo-eval

While you're building an LLM, you evaluate it over and over across many interventions. Every adjustment to its data, architecture, or hyperparameters — and every step up in scale — sends you back through the same loop: adding or reconfiguring benchmarks, re-running them on each new model checkpoint, noting the results, and checking whether something that helped in a small experiment still holds up on the full training run.

Most evaluation tools aren't designed for this—they’re either built to run established benchmarks across finished models or run a model through multi-step, tool-using problems in a sandbox. They don’t keep up with a model that's constantly changing, nor do they reflect how a model might behave under specific real-world conditions.

Our last project to address this evaluation challenge was OLMES, the Open Language Model Evaluation Standard. Introduced in 2024, it was meant to make LLM benchmark scores easier to compare across releases. The same models were being scored on the same benchmarks in different ways — aspects like prompt formatting and task formulation often varied from paper to paper — so claims about which models performed best often weren't reproducible. OLMES pinned benchmarking choices down in an open, documented standard, and it became the basis for evaluating our open models from Olmo to Tulu.

But a model's final score is only part of the evaluation process—which is why we're releasing olmo-eval, a new workbench that builds on OLMES and extends it across the rest of LLM development. Compared to OLMES, olmo-eval cuts down the work of implementing new evaluations, offers more flexibility in defining where and how they run, and makes it easier to compose individual components into larger workflows. Agentic and multi-turn evaluation is supported as a first-class use case, and stronger analysis tools help you judge whether an intervention actually improved on the baseline or the difference amounts to noise.

How olmo-eval differs from existing tools

*Is a 2.4pp change in performance enough to make a call?*

olmo-eval overlaps in some ways with Harbor, an open framework for evaluating AI agents inside containerized, sandboxed environments. But the two tools differ in their scope. Harbor is aimed mainly at running and publishing agent benchmarks; olmo-eval was built for the everyday work of developing a model—adding and configuring benchmarks, running them across checkpoints, and analyzing the results prompt by prompt instead of as a single overall score.

Harbor runs everything the same way—inside sealed, reproducible containers. Because containers can be resource-intensive, olmo-eval lets you choose how each benchmark runs instead. A benchmark that just needs a model to answer questions can run directly, which is faster and cheaper; a benchmark that needs a locked-down environment — say, one that runs code the model wrote — gets an isolated container setup. The lightweight path is the default, and olmo-eval only opts for the heavy setup when a benchmark actually requires it.

Harbor's process for adding a benchmark is built for evals you plan to publish and share publicly, with the extra verification steps that entails. olmo-eval is built for moving quickly while you develop, and how you add a benchmark depends on what the benchmark needs: a short definition for a basic eval, with options to let a model use tools as it works through a benchmark, or — for a benchmark that already has its own code and procedure — a thin wrapper so olmo-eval can run it as is and report the results alongside other benchmark scores in the same format.

Both Harbor and olmo-eval keep benchmarks separate from the runtime policy (how the model is run to produce its answers) so you can change one without rewriting the other, but olmo-eval is designed for greater modularity. In olmo-eval, the model being evaluated, the tools it can use, the containerized environment, and any helper models – like an LLM-as-a-judge – are all swappable components. You can reuse a tool across many harnesses, or plug a grading model into one benchmark without perturbing the others, and adjust small settings (e.g., the exact wording of the prompt) without extensive effort.

Harbor reports an overall score for each model. olmo-eval reports those scores too, each with a standard error and a minimum detectable effect (the smallest difference that can be reliably distinguished from noise). But the more useful view lines the same questions up across two model checkpoints and compares them one by one, with all else held fixed. This helps you to see whether a tiny change in an overall average might indicate a real improvement or simply noise.

If you're looking for...

olmo-eval offers

Authoring a multi-example benchmark

Task subclass with a DataSource, metrics, and scoring surface

Wrapping an existing agent-style benchmark with its own runner

ExternalEval or SandboxedExternalEval; the benchmark keeps its loop and scoring, and results land in olmo-eval's schema

Swapping the runtime under a fixed benchmark

--harness and harness presets; the harness carries provider, tools, scaffold, sandboxes, and auxiliary providers

Parallel container execution

Sandbox instances for parallel executors with capability-based routing, Docker or Modal modes

Tool definitions reusable across tasks and harnesses

@tool decorator with optional global registry

Multi-turn execution loops

Scaffolds, e.g., openai_agents, selected per harness, not baked into the task definition

An integrated evaluation stack

olmo-eval is composed of four components that are useful on their own but designed to work together to tighten the experimental LLM development loop:

A task/suite/harness abstraction that decouples benchmark logic from runtime policy. A task is how you define a benchmark in olmo-eval—what's being evaluated. A suite groups tasks into a set you run together, and a harness controls how each task is run. This separation lets the same task run as a standard baseline or with tools and scaffolding, without changing what it measures.

A sandbox and capability-routing layer, including an asynchronous sandbox planner. This supports evaluations where a model's response depends on the actions it takes using tools, like writing and running code or browsing the web. The point is to evaluate the model's real tool use: when a benchmark calls for tools, olmo-eval runs those tools and feeds the results back to the model.

A normalized experiment schema that records every run, its configuration, and the results in the same structured format. This makes it possible to group related experiments, compare checkpoints over time, and avoid the inconsistencies that often accumulate in long-running model development workflows.

A results viewer for pairwise model comparison: lining two models or checkpoints up question by question surfaces small but real performance changes that an overall average can hide.

In most model evaluation setups, adding a benchmark is a sizeable integration project. In olmo-eval, all that’s needed is a task—tasks define the benchmark dataset, how evaluation requests are built, and how model answers are scored (all code in Python):

code


from olmo_eval.common.formatters import ChatFormatter
from olmo_eval.common.metrics import AccuracyMetric
from olmo_eval.common.scorers import ExactMatchScorer
from olmo_eval.common.types import Instance, SamplingParams
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")
class InternalFreshQA(Task):
    data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")
    formatter = ChatFormatter()
    sampling_params = SamplingParams(temperature=0.0)
    metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

    @property
    def instances(self):
        loader = DataLoader()
        for idx, doc in enumerate(loader.load(self.config.get_data_source())):
            yield Instance(
                question=doc["question"],
                gold_answer=doc["answer"],
                metadata={"id": doc.get("id", f"freshqa_{idx}")},
            )

Variants express changes in evaluation policy without duplicating the benchmark:

code


register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("internal_freshqa", "zero", num_fewshot=0)

Suites group benchmarks into standard sets you run together:

code


from olmo_eval.evals.suites import Suite, register

register(Suite(
    name="base_qa_few_shot",
    tasks=(
"sciq:mc:3shot",
"arc_challenge:mc:3shot",
"internal_freshqa:mc:3shot",
    ),
))

And because runtime policy lives in the harness rather than the task definition, the same benchmark can be easily rerun under different execution rather than relying on whether a generated point track merely looks plausible.

code


# Baseline
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

# Same task, same scoring, search/tool runtime enabled
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

Reproducible evaluation made open

Use olmo-eval when evaluation is part of ongoing model development rather than a one-off run—when you need to run the same benchmarks repeatedly across checkpoints under reproducible conditions and compare interventions at both the aggregate and per-question level.

If your recurring question is “How does this checkpoint differ from the last one, and where exactly did it improve or regress?”, that’s the workflow olmo-eval is built for.

Reproducible evaluation should keep pace with how models are built—not only how they're scored once they're finished. olmo-eval carries the OLMES standard into active model development, and we're releasing it openly so the community can build on it.

この記事をシェア

Hugging Face Blog★42026年6月19日 03:13

MosaicLeaks：研究エージェントは秘密を守れるか？

Hugging Face は、AI エージェントが機密情報を漏洩するリスクを検証する「MosaicLeaks」という評価フレームワークを発表した。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

同じタスク、同じスコアリング、検索/ツールランタイム有効化

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

再現可能な評価のオープン化

from olmo_eval.common.formatters import ChatFormatter from olmo_eval.common.metrics import AccuracyMetric from olmo_eval.common.scorers import ExactMatchScorer from olmo_eval.common.types import Instance, SamplingParams from olmo_eval.data import DataLoader, DataSource from olmo_eval.evals.tasks.common import Task, register, register_variant @register("internal_freshqa") class InternalFreshQA(Task): data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test") formatter = ChatFormatter() sampling_params = SamplingParams(temperature=0.0) metrics = (AccuracyMetric(scorer=ExactMatchScorer),) @property def instances(self): loader = DataLoader() for idx, doc in enumerate(loader.load(self.config.get_data_source())): yield Instance( question=doc["question"], gold_answer=doc["answer"], metadata={"id": doc.get("id", f"freshqa_{idx}")}, )

# Baseline olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero # Same task, same scoring, search/tool runtime enabled olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

キーポイント

影響分析

編集コメント

olmo-eval と既存ツールの違い

統合評価スタック

ベースライン

同じタスク、同じスコアリング、検索/ツールランタイム有効化

再現可能な評価のオープン化

How olmo-eval differs from existing tools

An integrated evaluation stack

Reproducible evaluation made open

関連記事

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

キーポイント

影響分析

編集コメント

olmo-eval と既存ツールの違い

統合評価スタック

ベースライン

同じタスク、同じスコアリング、検索/ツールランタイム有効化

再現可能な評価のオープン化

How olmo-eval differs from existing tools

An integrated evaluation stack

Reproducible evaluation made open

関連記事