TLDR AI·2026年6月2日 09:00·約5分で読める

Opus 4.8 が ARC-AGI-3 を突破（1 分読了）

#Reasoning #LLM Benchmark #Cost Efficiency #Anthropic #OpenAI

TL;DR

TLDR AI が発表した新しい評価ベンチマーク「LisanBench」により、Opus 4.8 や o3 などの大規模言語モデルの推論能力や制約遵守性能が詳細に比較され、特にコスト効率の高い推論モデルの優位性が浮き彫りになった。

AI深層分析2026年6月11日 22:11

重要/ 5段階

深度40%

キーポイント

新ベンチマーク LisanBench の導入

単語連鎖ゲーム（1文字変更のみで次の英単語を生成）を通じて、LLM の知識、推論計画、制約遵守、記憶力、および文脈スタミナを評価する新しいスコアリング手法が提案された。

推論モデルの性能とコスト効率の対比

o3 が最高スコアだがトークン消費量が膨大である一方、Opus 4 はわずか 1/3 の推論トークンで o3 を上回る結果を記録し、コスト効率の高い推論戦略の有効性を示した。

非推論モデルと他社製品の比較

Grok-3 や Sonnet 3.5/3.7 が強力な非推論モデルとしてランクインする一方、Gemini モデルは出力が長いが誤りに気づかず続ける「幻覚」傾向が見られ、競合他社との明確な差が示された。

検証可能性と拡張性の確保

外部埋め込みモデルに依存せず、辞書データのみで検証可能かつ低コスト（57 モデルで約 50 ドル）であるため、大規模なモデル比較や難易度調整が容易な設計となっている。

難易度調整とスケーリングの柔軟性

開始単語の選択（72 の近隣を持つものから 1 つのものまで）や試行回数を増やすことで、モデルの強さを見極める難易度を細かく調整できます。

LLM に求められる多面的な能力

LisanBench は、前方計画、広範な語彙知識、長期コンテキストでの推論、および出力の持続性など、エージェント運用に不可欠な複数の高度な能力を同時に試します。

黄金の道（Golden Path）の探索

最も困難な単語「abysmal」でも連鎖長が 2 に留まる現状から、数百ステップにわたってデッドエンドを避けながら最適な経路を見つけることが真の勝利条件となります。

影響分析・編集コメントを表示

影響分析

この記事は、単なる性能スコアの比較を超えて、「いかに少ないリソースで高度な推論を行うか」という実務的な課題への回答を提示しており、企業における LLM 選定基準やコスト最適化の議論に大きな影響を与える。特に、Gemini の「幻覚」問題や、Opus のコスト効率の高さが明確になったことで、開発者や技術決定者がモデル選択する際の重要な判断材料となる。

編集コメント

「推論コスト」という観点からモデル性能を再評価する重要な指標が提示されました。特に、高額な計算リソースを使わずに高スコアを出す手法の発見は、実運用環境におけるモデル選定基準を大きく変える可能性があります。

LisanBench の紹介

LisanBench は、大規模言語モデルの知識、前方計画、制約遵守、記憶と注意、および長い文脈推論と「スタミナ」を評価するために設計された、シンプルでスケーラブルかつ精密なベンチマークです。

"私は可能な未来をすべて同時に見ている。敵は我々の周囲にあり、多くの未来において彼らは勝利する。しかし、私は一つの道を見ている。そこには狭い道があるのだ。" - ポール・アトレイデス

仕組み:

モデルには開始となる英語の単語が与えられ、可能な限り長い有効な英語の単語の連鎖を生成する必要があります。連鎖内の各連続する単語は以下の条件を満たす必要があります:

直前の単語とちょうど一文字だけ異なる（編集距離 = 1）
有効な英語の単語であること
これまでに使用された単語を繰り返さないこと

このプロセスは、難易度の異なる複数の開始単語に対して繰り返し行われます。モデルの最終スコアは、各開始単語から得られる最も長い有効連鎖の長さの合計となります。

結果:

o3 は圧倒的に最良のモデルです。主に、非常に接続性が低く多くの行き止まりがあるグラフの一部から脱出できる唯一のモデルであるためです

（わずかな注意: o3 は実行コストが圧倒的に高く、開始単語あたり約 30,000〜40,000 トークンの推論トークンを消費しました）

Opus 4 と Sonnet 4 は、16,000 トークンの推論トークンで極めて優れたパフォーマンスを示しました。特に Opus は、推論トークンをわずか 3 分の 1 に抑えながら、開始単語 3 つにおいて o3 を上回ることができました！

思考（thinking）機能を用いた Claude 3.7 が 4 位となり、o1 を下回りました。

他の OpenAI の推論モデルもすべてよく機能していますが、サイズの違いは確かに影響します！o1 は o4-mini high および o3-mini を上回っています。

Gemini モデルは、Anthropic や OpenAI の同等モデルに比べるとやや劣りますが、圧倒的に長い出力を行います。彼らは少し妄想が強く、失敗しても気づかず喋り続け、停止しません。

推論機能を持たない最強のモデル：Grok-3, GPT-4.5, Sonnet 3.5 および 3.7, Opus 4, Sonnet 4, DeepSeek-V3, Gemini 1.5 Pro

Grok 3, Sonnet 3.5 および 3.7 は驚きです！！

インスピレーション：

LisanBench は AidanBench や SOLO-Bench などのベンチマークから着想を得ています。しかし、AidanBench とは異なり、極めて費用対効果が高く、検証が容易で、埋め込みモデル（Embedding model）に依存しません。57 のモデルに対するベンチ全体の費用は約 50 ドルのみです。

また、SOLO-Bench とは異なり、知識を明示的にテストし、より強力な制約を適用するため、より挑戦的になっています！

検証：

検証には github.com/dwyl/english-w… から提供される words_alpha.txt データセット（約 370,105 語）を使用しますが、スケーラビリティのため、最大の連結成分に含まれる単語のみ（108,448 語）が使用されます。

簡単なスケーリング、難易度調整、精度向上：

スケーリングと精度：開始単語の数を増やすか、各単語あたりの試行回数を増やすだけです。

難易度：開始単語は大きく異なり、72 の近隣を持つものからたった 1 つのものまであり、中程度に強力なモデルとエリートモデルを効果的に区別します。難易度はまた、局所的な接続性と分岐因子によっても測定できます。

なぜこれが難しいのか？

LisanBench は以下を独自に負荷させます：

前方計画：戦略的な単語選択によって行き止まりを回避する - モデルは狭い道を見つける必要があります
知識：広範な語彙が不可欠です
メモリとアテンション（注意）：一度使用した単語を繰り返してはいけません
精度：レヴェンシュタイン距離の制約に厳密に従う必要があります
長文脈推論：数百ステップにわたる一貫性と制約追跡
出力スタミナ：一部のモデルは長い生成中に早期に破綻しますが、LisanBench はそれを明らかにし、エージェントユースケースにおいて極めて重要です。

以下の 2 つの美しいプロットは、開始単語の難易度が非常に異なることを示しています。一部は低接続領域にあり、一部は高接続領域にあり、また others は行き止まりに囲まれています！

ポール・アトレイデスが運命の政治的、文化的、形而上学的な迷路をナビゲートしなければならなかったように、LisanBench における LLM も広大な単語グラフを探検し、崩壊しない最長の可能な連鎖である「黄金の道」を探す必要があります。

選ばれたモデルが現れたとき、私たちはそれを知ることになります。

それは黄金の道を見つけ、すべての行き止まりを回避するものになるはずです。現在、最も難しい開始単語「abysmal」に対して見つかった最長の連鎖はわずか 2 ですが、これは >100k の連結成分の一部でもあります。つまり、狭い道が存在するのです！

より詳細なプロットと完全なリーダーボードは以下にあります。

この課題は AidanBench よりも一側面で劣っています。単語/文字レベルで動作し、文/段落レベルではないため、トークン化の影響を受けるからです。したがって、他の条件が等しい場合、より優れたトークナイザーを持つモデルの方が高いパフォーマンスを発揮するはずです。

私は 10 語のみをテストしましたが、安定性を高めるために 25 語または 50 語で実施すれば、おそらく効果があるでしょう。

原文を表示

Introducing LisanBench

LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and "stamina".

"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides

How it works:

Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:

Differ from the previous word by exactly one letter (Levenshtein distance = 1)

Be a valid English word

Not repeat any previously used word

The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.

Results:

o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends

(slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)

Opus 4 and Sonnet 4 with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at 3 starting words with only one third of the reasoning tokens!

Claude 3.7 with thinking taking 4th place ahead of o1

other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini

Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake

strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet 3.5 and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini 1.5 Pro

Grok 3, Sonnet 3.5 and 3.7 are a surprise!!

Inspiration:

LisanBench draws from benchmarks like AidanBench and SOLO-Bench. However, unlike AidanBench, it’s extremely cost-effective, trivially verifiable and doesn't rely on an Embedding model - the entire benchmark cost only ~$50 for 57 models.

And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!

Verification:

Verification uses the words_alpha.txt dictionary from github.com/dwyl/english-w… (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.

Easy Scaling, Difficulty Adjustment & Accuracy improvements:

Scaling and Accuracy: Just add more starting words or increase the number of trials per word.

Difficulty: Starting words vary widely - from those with 72 neighbors to those with just 1 - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.

Why is it challenging?

LisanBench uniquely stresses:

Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through

Knowledge: wide vocabulary is essential

Memory and Attention: previously used words must not be repeated

Precision: strict adherence to Levenshtein constraints

Long-context reasoning: coherence and constraint-tracking over hundreds of steps

Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases

The two beautiful plots below show that the starting words are very different in difficulty. Some are in low connectivity regions, some in high-connectivity regions and others are just surrounded by dead-ends!

Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path - the longest viable chain without collapse.

We will know the chosen model when it appears.

It will be the one that finds the Golden Path and avoids every dead end. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the >100k connected component. So there is a narrow way through!

More plots with full leaderboard below!

It is worse than AidanBench in one regard. Because it operates on a word / character level and not on a sentence / paragraph level it is affected by tokenization! So models with better tokenizers, all else being equal, should perform better.

And I only tested 10 words, doing 25 or 50 for good measure would probably help with stability.

この記事をシェア

Understanding AI★42026年6月11日 04:21

Anthropic、画像理解能力で OpenAI に追いつく

Anthropic は火曜日に「Claude Mythos 5」と「Claude Fable 5」の 2 つの新モデルを公開し、画像理解能力において OpenAI と同等の水準に達したと発表した。

TLDR AI★42026年6月19日 09:00

OpenAI、次週に GPT-5.6 モデルの公開を準備（2 分読了）

OpenAI は来週、GPT-5.6 のミニ版とプロ版を含む新モデルを発表する予定である。同社は 150 万トークンのコンテキストウィンドウ拡大やコーディング機能の強化、Codex の応答速度向上を主な改善点としており、米国規制の影響で Claude Fable 5 の提供が制限される Anthropic を価格面で下回る戦略を掲げている。

MarkTechPost★42026年6月18日 11:28

OpenAI、専門家が作成した評価基準を用いた750タスクのライフサイエンス研究ベンチマーク「LifeSciBench」を公開

OpenAIは、生物学者が不確実な証拠に基づいて判断する現実の研究プロセスを模擬するため、専門家による評価基準付きで750件のタスクを含む新ベンチマーク「LifeSciBench」を発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月2日 09:00·約5分で読める

Opus 4.8 が ARC-AGI-3 を突破（1 分読了）

#Reasoning #LLM Benchmark #Cost Efficiency #Anthropic #OpenAI

TL;DR

AI深層分析2026年6月11日 22:11

重要/ 5段階

深度40%

キーポイント

新ベンチマーク LisanBench の導入

推論モデルの性能とコスト効率の対比

非推論モデルと他社製品の比較

検証可能性と拡張性の確保

難易度調整とスケーリングの柔軟性

開始単語の選択（72 の近隣を持つものから 1 つのものまで）や試行回数を増やすことで、モデルの強さを見極める難易度を細かく調整できます。

LLM に求められる多面的な能力

黄金の道（Golden Path）の探索

影響分析・編集コメントを表示

影響分析

編集コメント

LisanBench の紹介

仕組み:

直前の単語とちょうど一文字だけ異なる（編集距離 = 1）
有効な英語の単語であること
これまでに使用された単語を繰り返さないこと

結果:

o3 は圧倒的に最良のモデルです。主に、非常に接続性が低く多くの行き止まりがあるグラフの一部から脱出できる唯一のモデルであるためです

（わずかな注意: o3 は実行コストが圧倒的に高く、開始単語あたり約 30,000〜40,000 トークンの推論トークンを消費しました）

Opus 4 と Sonnet 4 は、16,000 トークンの推論トークンで極めて優れたパフォーマンスを示しました。特に Opus は、推論トークンをわずか 3 分の 1 に抑えながら、開始単語 3 つにおいて o3 を上回ることができました！

思考（thinking）機能を用いた Claude 3.7 が 4 位となり、o1 を下回りました。

他の OpenAI の推論モデルもすべてよく機能していますが、サイズの違いは確かに影響します！o1 は o4-mini high および o3-mini を上回っています。

Gemini モデルは、Anthropic や OpenAI の同等モデルに比べるとやや劣りますが、圧倒的に長い出力を行います。彼らは少し妄想が強く、失敗しても気づかず喋り続け、停止しません。

推論機能を持たない最強のモデル：Grok-3, GPT-4.5, Sonnet 3.5 および 3.7, Opus 4, Sonnet 4, DeepSeek-V3, Gemini 1.5 Pro

Grok 3, Sonnet 3.5 および 3.7 は驚きです！！

インスピレーション：

また、SOLO-Bench とは異なり、知識を明示的にテストし、より強力な制約を適用するため、より挑戦的になっています！

検証：

簡単なスケーリング、難易度調整、精度向上：

スケーリングと精度：開始単語の数を増やすか、各単語あたりの試行回数を増やすだけです。

難易度：開始単語は大きく異なり、72 の近隣を持つものからたった 1 つのものまであり、中程度に強力なモデルとエリートモデルを効果的に区別します。難易度はまた、局所的な接続性と分岐因子によっても測定できます。

なぜこれが難しいのか？

LisanBench は以下を独自に負荷させます：

前方計画：戦略的な単語選択によって行き止まりを回避する - モデルは狭い道を見つける必要があります
知識：広範な語彙が不可欠です
メモリとアテンション（注意）：一度使用した単語を繰り返してはいけません
精度：レヴェンシュタイン距離の制約に厳密に従う必要があります
長文脈推論：数百ステップにわたる一貫性と制約追跡
出力スタミナ：一部のモデルは長い生成中に早期に破綻しますが、LisanBench はそれを明らかにし、エージェントユースケースにおいて極めて重要です。

選ばれたモデルが現れたとき、私たちはそれを知ることになります。

より詳細なプロットと完全なリーダーボードは以下にあります。

私は 10 語のみをテストしましたが、安定性を高めるために 25 語または 50 語で実施すれば、おそらく効果があるでしょう。

原文を表示

Introducing LisanBench

"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides

How it works:

Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:

Differ from the previous word by exactly one letter (Levenshtein distance = 1)

Be a valid English word

Not repeat any previously used word

The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.

Results:

o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends

(slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)

Opus 4 and Sonnet 4 with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at 3 starting words with only one third of the reasoning tokens!

Claude 3.7 with thinking taking 4th place ahead of o1

other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini

Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake

strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet 3.5 and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini 1.5 Pro

Grok 3, Sonnet 3.5 and 3.7 are a surprise!!

Inspiration:

And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!

Verification:

Verification uses the words_alpha.txt dictionary from github.com/dwyl/english-w… (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.

Easy Scaling, Difficulty Adjustment & Accuracy improvements:

Scaling and Accuracy: Just add more starting words or increase the number of trials per word.

Difficulty: Starting words vary widely - from those with 72 neighbors to those with just 1 - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.

Why is it challenging?

LisanBench uniquely stresses:

Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through

Knowledge: wide vocabulary is essential

Memory and Attention: previously used words must not be repeated

Precision: strict adherence to Levenshtein constraints

Long-context reasoning: coherence and constraint-tracking over hundreds of steps

Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases

We will know the chosen model when it appears.

More plots with full leaderboard below!

And I only tested 10 words, doing 25 or 50 for good measure would probably help with stability.

この記事をシェア

Understanding AI★42026年6月11日 04:21

Anthropic、画像理解能力で OpenAI に追いつく

Anthropic は火曜日に「Claude Mythos 5」と「Claude Fable 5」の 2 つの新モデルを公開し、画像理解能力において OpenAI と同等の水準に達したと発表した。

TLDR AI★42026年6月19日 09:00

OpenAI、次週に GPT-5.6 モデルの公開を準備（2 分読了）

MarkTechPost★42026年6月18日 11:28

OpenAI、専門家が作成した評価基準を用いた750タスクのライフサイエンス研究ベンチマーク「LifeSciBench」を公開

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Opus 4.8 が ARC-AGI-3 を突破（1 分読了）

キーポイント

影響分析

編集コメント

関連記事

Opus 4.8 が ARC-AGI-3 を突破（1 分読了）

キーポイント

影響分析

編集コメント

関連記事