Allen AI (AI2)·2026年6月12日 17:00·約10分で読める

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

#LLM #Open Source #Evaluation Benchmarks #Allen AI #Model Development

TL;DR

Allen AI は、LLM 開発プロセス全体を支援する評価ワークベンチ「olmo-eval」を公開し、モデルの継続的な改善と再現性の高い分析を可能にする新たな標準を提供した。

AI深層分析2026年6月13日 02:03

重要/ 5段階

深度40%

キーポイント

開発ループに特化した評価ワークベンチ

既存ツールが完成品の評価やサンドボックス環境のテストに偏る中、olmo-eval はデータ変更やハイパーパラメータ調整など、モデル開発中の継続的な評価・分析を目的として設計されている。

OLMES 標準の拡張と柔軟性の向上

既存のオープン標準「OLMES」に基づきつつ、ベンチマークの実行場所や方法のカスタマイズを可能にし、アジェンシー型やマルチターン評価を第一級ユースケースとしてサポートする。

コンテナ依存からの脱却とコスト最適化

Harbor のような完全なコンテナ環境に縛られず、単純な質問応答タスクなどは軽量に実行できる柔軟性を提供し、リソース効率を向上させる。

統計的有意性の分析機能の強化

単なるスコア比較ではなく、介入が本当に改善をもたらしたのか、それともノイズに過ぎないのかを判断するための強力な分析ツールを搭載している。

影響分析・編集コメントを表示

影響分析

この発表は、LLM 開発における評価プロセスの標準化と効率化に大きな影響を与える可能性があり、特にオープンソースモデルの開発コミュニティにおいて、再現性の高い実験環境の構築を促進する。従来の「完成品評価」中心のアプローチから「継続的改善評価」へのパラダイムシフトを示しており、より質の高いモデル開発サイクルの実現に寄与する。

編集コメント

開発中のモデルをその場で評価し、改善点を即座に分析できるツールは、LLM 研究のスピードを劇的に向上させる重要なインフラです。特に「ノイズと改善の区別」を支援する機能は、実務における判断ミスを減らす上で極めて価値が高いと言えます。

LLM を構築している間、あなたは多くの介入を通じて何度も評価を行います。データ、アーキテクチャ、ハイパーパラメータへのあらゆる調整、およびスケールアップのすべてのステップは、同じループに戻ります：ベンチマークの追加または再構成、各新しいモデルチェックポイントでの再実行、結果の記録、そして小規模な実験で役立ったものが大規模なトレーニングランでも有効かどうかの確認です。ほとんどの評価ツールはこの目的のために設計されていません。それらは完成したモデルに対して確立されたベンチマークを実行するために作られているか、サンドボックス内で多段階のツール使用問題を通じてモデルを実行するものです。常に変化し続けるモデルに追いつくことはできず、特定の現実世界の条件下でモデルがどのように振る舞うかを反映することもありません。

この評価課題に対処するための私たちの前回のプロジェクトは OLMES、すなわち Open Language Model Evaluation Standard（オープン言語モデル評価標準）です。2024 年に導入されたこの基準は、異なるリリース間での LLM ベンチマークスコアの比較を容易にするために作られました。同じモデルが異なる方法で同じベンチマークにスコア付けされており、プロンプトのフォーマットやタスクの定式化といった側面は論文ごとにしばしば異なっていたため、「どのモデルが最も良く機能したか」という主張の多くは再現不可能でした。OLMES はベンチマーキングの選択事項をオープンで文書化された標準に固定し、Olmo から Tulu までの私たちのオープンモデルの評価の基礎となりました。

しかし、モデルの最終スコアは評価プロセスの一部に過ぎません。そのため、OLMES に基づき LLM 開発の残りの部分へと拡張された新しいワークベンチ olmo-eval を公開します。OLMES と比較して、olmo-eval は新規評価の実装にかかる作業を削減し、評価を実行する場所や方法を定義する際の柔軟性を高め、個々のコンポーネントをより大きなワークフローに組み立てることを容易にします。エージェント型および多ターン評価はファーストクラスのユースケースとしてサポートされており、強力な分析ツールによって、介入がベースラインに対して実際に改善をもたらしたのか、それとも単なるノイズの差に過ぎないのかを判断しやすくなっています。

olmo-eval と既存ツールの違い

olmo-eval は、コンテナ化されたサンドボックス環境内で AI エージェントを評価するためのオープンフレームワークである Harbor と、いくつかの点で重複しています。しかし、両者のツールは範囲において異なります。Harbor は主にエージェントベンチマークの実行と公開を対象としていますが、olmo-eval はモデル開発における日常的な作業のために構築されました。具体的には、ベンチマークの追加と設定、チェックポイント間での実行、そして単一の総合スコアではなくプロンプトごとに結果を分析する作業です。

Harbor は、すべての処理を密封された再現可能なコンテナ内で同じ方法で実行します。コンテナはリソースを多く消費する可能性があるため、olmo-eval では各ベンチマークの実行方法をユーザーが選択できます。単にモデルに質問に回答させるだけで済むベンチマークは直接実行でき、これはより高速かつ低コストです。一方、コードを実行するなど厳格な環境が必要なベンチマーク（例：モデルが作成したコードを実行するもの）には、隔離されたコンテナ設定が用意されます。軽量なパスがデフォルトであり、olmo-eval はベンチマークが実際にそれを必要とする場合にのみ、重厚なセットアップを選択します。

Harbor のベンチマーク追加プロセスは、公開して共有することを前提とした評価（evals）向けに設計されており、それに伴う追加の検証ステップが含まれています。一方、olmo-eval は開発中の迅速な対応を目的としており、ベンチマークの追加方法はその要件によって異なります。基本的な評価には短い定義を提供し、モデルがツールを使用できるオプションも用意します。あるいは、すでに独自のコードと手順を持つベンチマークの場合、olmo-eval がそのまま実行して結果を他のベンチマークスコアと同じ形式で報告できるようにする薄いラッパー（wrapper）を用意します。

⟦CODE_0⟧

Harbor と olmo-eval の両方とも、ベンチマークをランタイムポリシー（モデルを実行して回答を生成する方法）から分離して保持しているため、一方を変更しても他方を再記述する必要はありませんが、olmo-eval はより高いモジュール性を備えて設計されています。olmo-eval では、評価対象となるモデル、使用可能なツール、コンテナ化された環境、および LLM-as-a-judge（LLM を評決者として利用する仕組み）のようなヘルパーモデルなどすべてが交換可能なコンポーネントです。ツールを多数のハーンネスで再利用したり、あるベンチマークに評価用モデルを接続して他者に影響を与えたり、プロンプトの正確な wording（文言）などの小さな設定を広範な労力なく調整したりすることが可能です。

Harbor は各モデルに対して総合スコアを報告します。olmo-eval も同様にスコアを報告しますが、それぞれに標準誤差と検出可能な最小効果（ノイズから確実に区別できる最小限の差異）を含めます。しかし、より有用な視点としては、同じ質問を 2 つのモデルチェックポイント間で並べて比較し、他はすべて固定した状態で一つずつ対照させることです。これにより、総合平均におけるわずかな変化が実際の改善を示しているのか、単なるノイズに過ぎないのかを把握できます。

もしあなたが探しているものが…なら、olmo-eval は以下を提供します

複数例のベンチマーク作成TaskサブクラスとDataSource、メトリクス、評価対象領域

既存のエージェント型ベンチマークを独自のランナーでラップするExternalEvalまたはSandboxedExternalEval; ベンチマークはループとスコアリングを維持し、結果はスキーマに格納されます

固定されたベンチマークharnessとharnessのプリセットの下でランタイムを切り替えること; harness はプロバイダー、ツール、スケフォールド、サンドボックス、および補助的なプロバイダーを担う

並列コンテナ実行: 能力ベースルーティングを持つ並列実行用のサンドボックスインスタンス。Docker または Modal モードに対応

タスクやharness間で再利用可能なツールの定義:@toolデコレーターとオプションのグローバルレジストリ

マルチターン実行ループ: openai_agentsなどのスケフォールド。harness ごとに選択され、タスク定義に埋め込まれるものではない

統合された評価スタック

olmo-eval は、それぞれ単独でも有用でありながら、実験的な LLM 開発ループを強化するために連携して動作するように設計された4つのコンポーネントで構成されています:

ベンチマークロジックとランタイムポリシーを分離するタスク/スイート/ハーン抽象化。タスクとは、olmo-eval におけるベンチマークの定義方法（何を評価するか）です。スイートは実行するセットとしてタスクをグループ化し、ハーンは各タスクの実行方法を制御します。この分離により、測定対象を変更することなく、同じタスクを標準ベースラインとして、あるいはツールやスケフォールディング付きで実行することが可能になります。
非同期サンドボックスプランナーを含むサンドボックスおよび機能ルーティングレイヤー。これは、モデルの応答がツール（コードの記述と実行やウェブブラウジングなど）の使用に基づくアクションに依存する評価をサポートします。ポイントは、モデルの実際のツール使用を評価することです。ベンチマークでツールの利用が必要となる場合、olmo-eval はそのツールを実行し、結果をモデルにフィードバックします。
各ラン、その構成、および結果を同じ構造化形式で記録する正規化された実験スキーマ。これにより、関連する実験のグループ化や、時間経過に伴うチェックポイントの比較が可能になり、長期にわたるモデル開発ワークフローで蓄積しがちな不整合性を回避できます。
2 つのモデルを対比して比較するための結果ビューア：質問ごとに 2 つのモデルまたはチェックポイントを並べることで、全体の平均値では隠れてしまうが実在する小さなパフォーマンス変化を浮き彫りにします。

ほとんどのモデル評価設定では、ベンチマークを追加するには大規模な統合プロジェクトが必要です。しかし olmo-eval では、*タスク* 1 つで十分です。タスクは、ベンチマークデータセットの定義、評価リクエストの構築方法、およびモデル回答の採点方法を規定します：

from olmo_eval.common.formatters import ChatFormatter

from olmo_eval.common.metrics import AccuracyMetric

from olmo_eval.common.scorers import ExactMatchScorer

from olmo_eval.common.types import Instance, SamplingParams

from olmo_eval.data import DataLoader, DataSource

from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")

class InternalFreshQA(Task):

data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")

formatter = ChatFormatter()

sampling_params = SamplingParams(temperature=0.0)

metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

@property

def instances(self):

loader = DataLoader()

for idx, doc in enumerate(loader.load(self.config.get_data_source())):

yield Instance(

question=doc["question"],

gold_answer=doc["answer"],

metadata={"id": doc.get("id", f"freshqa_{idx}"),}

)

バリアントは、ベンチマークを複製することなく評価ポリシーの変更を表します:

register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)

register_variant("internal_freshqa", "zero", num_fewshot=0)

スイートは、まとめて実行する標準セットとしてベンチマークをグループ化します:

from olmo_eval.evals.suites import Suite, register

name="base_qa_few_shot",

tasks=(

"sciq:mc:3shot",

"arc_challenge:mc:3shot",

"internal_freshqa:mc:3shot",

))実行ポリシーがタスク定義ではなくハネス（harness）に存在するため、生成されたポイントトラックが単に妥当に見えるかという依存関係に頼るのではなく、異なる実行環境下で同じベンチマークを容易に再実行できます。

ベースライン

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

同じタスク、同じスコアリング、検索/ツール実行機能を有効化

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

再現可能な評価のオープン化

評価が一度きりの実行ではなく、継続的なモデル開発の一部である場合、つまり、再現可能な条件下でチェックポイント間で同じベンチマークを繰り返し実行し、集計レベルと個別質問レベルの両方で介入を比較する必要がある場合は、olmo-eval を使用してください。

「このチェックポイントは前回のものとどう異なり、具体的にどこで改善または後退したのか？」という繰り返し質問がある場合、それが olmo-eval が構築されたワークフローです。

再現可能な評価は、モデルが完成して一度スコアリングされる方法だけでなく、モデルがどのように構築されるかにも合わせて進化するべきです。olmo-eval は OLMES 標準をアクティブなモデル開発に持ち込み、コミュニティがこれに基づいて構築できるようにオープンにリリースしています。

Ai2 の最新ニュースに関する月次更新を受け取るには購読してください。

原文を表示

While you're building an LLM, you evaluate it over and over across many interventions. Every adjustment to its data, architecture, or hyperparameters – and every step up in scale – sends you back through the same loop: adding or reconfiguring benchmarks, re-running them on each new model checkpoint, noting the results, and checking whether something that helped in a small experiment still holds up on the full training run. Most evaluation tools aren't designed for this—they’re either built to run established benchmarks across finished models or run a model through multi-step, tool-using problems in a sandbox. They don’t keep up with a model that's constantly changing, nor do they reflect how a model might behave under specific real-world conditions.

Our last project to address this evaluation challenge was OLMES, the Open Language Model Evaluation Standard. Introduced in 2024, it was meant to make LLM benchmark scores easier to compare across releases. The same models were being scored on the same benchmarks in different ways – aspects like prompt formatting and task formulation often varied from paper to paper – so claims about which models performed best often weren't reproducible. OLMES pinned benchmarking choices down in an open, documented standard, and it became the basis for evaluating our open models from Olmo to Tulu.

But a model's final score is only part of the evaluation process—which is why we're releasing olmo-eval, a new workbench that builds on OLMES and extends it across the rest of LLM development. Compared to OLMES, olmo-eval cuts down the work of implementing new evaluations, offers more flexibility in defining where and how they run, and makes it easier to compose individual components into larger workflows. Agentic and multi-turn evaluation is supported as a first-class use case, and stronger analysis tools help you judge whether an intervention actually improved on the baseline or the difference amounts to noise.

How olmo-eval differs from existing tools

olmo-eval overlaps in some ways with Harbor, an open framework for evaluating AI agents inside containerized, sandboxed environments. But the two tools differ in their scope. Harbor is aimed mainly at running and publishing agent benchmarks; olmo-eval was built for the everyday work of developing a model—adding and configuring benchmarks, running them across checkpoints, and analyzing the results prompt by prompt instead of as a single overall score.

Harbor runs everything the same way—inside sealed, reproducible containers. Because containers can be resource-intensive, olmo-eval lets you choose how each benchmark runs instead. A benchmark that just needs a model to answer questions can run directly, which is faster and cheaper; a benchmark that needs a locked-down environment – say, one that runs code the model wrote – gets an isolated container setup. The lightweight path is the default, and olmo-eval only opts for the heavy setup when a benchmark actually requires it.

Harbor's process for adding a benchmark is built for evals you plan to publish and share publicly, with the extra verification steps that entails. olmo-eval is built for moving quickly while you develop, and how you add a benchmark depends on what the benchmark needs: a short definition for a basic eval, with options to let a model use tools as it works through a benchmark, or – for a benchmark that already has its own code and procedure – a thin wrapper so olmo-eval can run it as is and report the results alongside other benchmark scores in the same format.

Both Harbor and olmo-eval keep benchmarks separate from the runtime policy (how the model is run to produce its answers) so you can change one without rewriting the other, but olmo-eval is designed for greater modularity. In olmo-eval, the model being evaluated, the tools it can use, the containerized environment, and any helper models – like an LLM-as-a-judge – are all swappable components. You can reuse a tool across many harnesses, or plug a grading model into one benchmark without perturbing the others, and adjust small settings (e.g., the exact wording of the prompt) without extensive effort.

Harbor reports an overall score for each model. olmo-eval reports those scores too, each with a standard error and a minimum detectable effect (the smallest difference that can be reliably distinguished from noise). But the more useful view lines the same questions up across two model checkpoints and compares them one by one, with all else held fixed. This helps you to see whether a tiny change in an overall average might indicate a real improvement or simply noise.

An integrated evaluation stack

olmo-eval is composed of four components that are useful on their own but designed to work together to tighten the experimental LLM development loop:

A task/suite/harness abstraction that decouples benchmark logic from runtime policy. A task is how you define a benchmark in olmo-eval—what's being evaluated. A suite groups tasks into a set you run together, and a harness controls how each task is run. This separation lets the same task run as a standard baseline or with tools and scaffolding, without changing what it measures.
A sandbox and capability-routing layer, including an asynchronous sandbox planner. This supports evaluations where a model's response depends on the actions it takes using tools, like writing and running code or browsing the web. The point is to evaluate the model's real tool use: when a benchmark calls for tools, olmo-eval runs those tools and feeds the results back to the model.
A normalized experiment schema that records every run, its configuration, and the results in the same structured format. This makes it possible to group related experiments, compare checkpoints over time, and avoid the inconsistencies that often accumulate in long-running model development workflows.
A results viewer for pairwise model comparison: lining two models or checkpoints up question by question surfaces small but real performance changes that an overall average can hide.

In most model evaluation setups, adding a benchmark is a sizeable integration project. In olmo-eval, all that’s needed is a *task*—tasks define the benchmark dataset, how evaluation requests are built, and how model answers are scored:

code

from olmo_eval.common.formatters import ChatFormatter
from olmo_eval.common.metrics import AccuracyMetric
from olmo_eval.common.scorers import ExactMatchScorer
from olmo_eval.common.types import Instance, SamplingParams
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")
class InternalFreshQA(Task):
    data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")
    formatter = ChatFormatter()
    sampling_params = SamplingParams(temperature=0.0)
    metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

    @property
    def instances(self):
        loader = DataLoader()
        for idx, doc in enumerate(loader.load(self.config.get_data_source())):
            yield Instance(
                question=doc["question"],
                gold_answer=doc["answer"],
                metadata={"id": doc.get("id", f"freshqa_{idx}")},
            )

Variants express changes in evaluation policy without duplicating the benchmark:

code

register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("internal_freshqa", "zero", num_fewshot=0)

Suites group benchmarks into standard sets you run together:

code

from olmo_eval.evals.suites import Suite, register

register(Suite(
    name="base_qa_few_shot",
    tasks=(
"sciq:mc:3shot",
"arc_challenge:mc:3shot",
"internal_freshqa:mc:3shot",
    ),
))

And because runtime policy lives in the harness rather than the task definition, the same benchmark can be easily rerun under different execution rather than relying on whether a generated point track merely looks plausible.

code

# Baseline
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

# Same task, same scoring, search/tool runtime enabled
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

Reproducible evaluation made open

Use olmo-eval when evaluation is part of ongoing model development rather than a one-off run—when you need to run the same benchmarks repeatedly across checkpoints under reproducible conditions and compare interventions at both the aggregate and per-question level.

If your recurring question is "How does this checkpoint differ from the last one, and where exactly did it improve or regress?", that’s the workflow olmo-eval is built for.

Reproducible evaluation should keep pace with how models are built—not only how they're scored once they're finished. olmo-eval carries the OLMES standard into active model development, and we're releasing it openly so the community can build on it.

Subscribe to receive monthly updates about the latest Ai2 news.

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

GitHub は、大規模なデータ組織が直面する自己完結型のデータアクセスと洞察提供の課題に対し、AI を活用した信頼性の高い解決策として、社内でデータ分析エージェントを構築したことを発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

同じタスク、同じスコアリング、検索/ツール実行機能を有効化

olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

再現可能な評価のオープン化

Ai2 の最新ニュースに関する月次更新を受け取るには購読してください。

from olmo_eval.common.formatters import ChatFormatter from olmo_eval.common.metrics import AccuracyMetric from olmo_eval.common.scorers import ExactMatchScorer from olmo_eval.common.types import Instance, SamplingParams from olmo_eval.data import DataLoader, DataSource from olmo_eval.evals.tasks.common import Task, register, register_variant @register("internal_freshqa") class InternalFreshQA(Task): data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test") formatter = ChatFormatter() sampling_params = SamplingParams(temperature=0.0) metrics = (AccuracyMetric(scorer=ExactMatchScorer),) @property def instances(self): loader = DataLoader() for idx, doc in enumerate(loader.load(self.config.get_data_source())): yield Instance( question=doc["question"], gold_answer=doc["answer"], metadata={"id": doc.get("id", f"freshqa_{idx}")}, )

# Baseline olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero # Same task, same scoring, search/tool runtime enabled olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

キーポイント

影響分析

編集コメント

olmo-eval と既存ツールの違い

統合された評価スタック

ベースライン

同じタスク、同じスコアリング、検索/ツール実行機能を有効化

再現可能な評価のオープン化

Ai2 の最新ニュースに関する月次更新を受け取るには購読してください。

How olmo-eval differs from existing tools

An integrated evaluation stack

Reproducible evaluation made open

Subscribe to receive monthly updates about the latest Ai2 news.

関連記事

OLMO-EVAL：モデル開発ループのための評価ワークベンチ

キーポイント

影響分析

編集コメント

olmo-eval と既存ツールの違い

統合された評価スタック

ベースライン

同じタスク、同じスコアリング、検索/ツール実行機能を有効化

再現可能な評価のオープン化

Ai2 の最新ニュースに関する月次更新を受け取るには購読してください。

How olmo-eval differs from existing tools

An integrated evaluation stack

Reproducible evaluation made open

Subscribe to receive monthly updates about the latest Ai2 news.

関連記事