Simon Willison Blog·2026年5月29日 08:59·約5分

Claude Opus 4.8：「控えめだが実感のある改善」

#LLM #Claude #Anthropic #Hallucination #Model Safety

TL;DR

Anthropic が発表した Claude Opus 4.8 は、大幅な機能強化ではなく「誠実さ」の向上に焦点を当てた漸進的アップデートであり、不確実性への対応やコード検証能力の改善が特徴である。

AI深層分析2026年5月29日 09:08

重要/ 5段階

深度40%

キーポイント

誠実性と不確実性の管理の強化

新モデルは推論プロセスにおける「誠実さ」を最優先し、根拠のない主張を行う頻度が前作より約4分の1に減少。特にコード作成時の欠陥を見逃す率が大幅に低下している。

ハルシネーション（幻覚）の抑制戦略

事実誤認率を低減させるために、正解を導き出すことよりも「不確実な質問には回答しない」という回避行動を採用していることがベンチマークで確認された。

価格設定と高速モードの再編

基本料金は従来通りだが、高速モード（Fast Mode）の価格が大幅に引き下げられたものの、現在は研究プレビュー参加組織限定となっている。

技術仕様の維持と新機能

コンテキストウィンドウや知識カットオフ日などは前モデルと同様だが、会話中にシステムメッセージを挿入する「Mid-conversation system messages」機能が追加された。

重要な引用

Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor.

Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.

影響分析・編集コメントを表示

影響分析

このリリースは、AI業界における「性能向上」への過度な期待に対し、モデルの信頼性と安全性を優先する方向性を示す重要な転換点である。特にコード生成や事実確認が必要な現場において、ハルシネーションによるリスクを低減させる新しい基準となる可能性が高く、企業導入時の評価指標にも影響を与えるだろう。

編集コメント

「劇的な進化」を謳う業界の風潮に対し、あえて漸進的改善と誠実さを強調する姿勢は、AI開発の成熟度を示す象徴的な事例と言えます。

Anthropic は本日、Claude Opus 4.8 をリリースしました。私が最も気に入っているのは、このリリース発表に含まれる以下の注記です。

ユーザーは、Opus 4.8 がその前身に対して控えめだが実感のある改善であると見なすでしょう。まだやるべきことは残されています：私たちは、Opus と同様の多くの機能を提供しつつも、より低コストで実現できるモデルの開発とリリースに取り組んでいます。

前モデルに対するわずかな漸進的な改善として、AI ラボが正直にリリースを説明している姿を見るのはとても refreshing です！

誠実さは一つのテーマのようです。その発表から私がもう一つ気に入っている注記をご紹介します。

Opus 4.8 における最も顕著な改善の一つは、その*誠実さ*です。私たちはすべてのモデルが誠実になるよう訓練しています——例えば、自分がサポートできない主張をしないようにです。しかし、AI モデルに共通する一般的な問題として、証拠が乏しいにもかかわらず、仕事で進歩を遂げたかのように自信満々に結論に至ってしまうことがあります。初期テスターによると、Opus 4.8 は自身の作業に関する不確実性を指摘する可能性が高く、根拠のない主張をする可能性は低くなっています。これは私たちの評価によって裏付けられており、Opus 4.8 が前身モデルに比べて、自身が作成したコードの欠陥を指摘せずに見過ごす確率が約 4 分の 1 であることが示されています。

このリンク先のシステムカードには、以下の内容が含まれています。

Claude Opus 4.8 は、6 つのモデルの中ですべてのベンチマークにおいて最も低い誤答率を記録しました。これは事実上の幻覚（hallucination）を測る最も直接的な指標です。この成果は、主に不確かな質問に対して回答を控えることで達成されたものであり、より多くの質問に正しく回答したからではありません。

モデルの特徴

4.7 版以来大きな変更はありません。

価格は Opus 4.5/4.6/4.7 と同じく、入力 100 万トークンあたり 5 ドル、出力 100 万トークンあたり 25 ドルです。「高速モード（Fast mode）」は価格が倍の 10 ドル/150 ドルですが、これは前世代モデルからの大幅な値下げです。4.6/4.7 の「高速モード」は依然として 30 ドル/150 ドルのままです。なお、「高速モード」は研究プレビューの一部である組織のみ利用可能です。「アクセスを希望する場合はアカウントマネージャーにご連絡ください」とのことです。

信頼できる知識の更新日（knowledge cutoff）とトレーニングデータの更新日は、4.7 版と同じく 2026 年 1 月です。

コンテキストウィンドウは引き続き 1,000,000 トークン、最大出力トーク数は 128,000 トークンのままです。

「Claude Opus 4.8 の新機能」[https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8] ドキュメントには、より興味深い詳細がいくつか記載されています。私が特に注目したのは以下の点です：

会話中のシステムメッセージ。Claude Opus 4.8 は、メッセージ配列内のユーザーの発言直後に「system」ロールを持つメッセージを即座に受け入れます（配置ルールに従う必要があります）。これにより、実行中の長い会話の後半で更新された指示を追加しても、システムプロンプト全体を再述する必要がなくなります。その結果、以前の発言におけるプロンプトキャッシュのヒット率が維持され、エージェントループにおける入力コストを削減できます。

Anthropic Python SDK に関するこの更新も参照してください。会話中にシステムプロンプトを操作できる機能は非常に強力に思えます。私はこれが、会話ごとに単一のシステムプロンプトを想定している私の独自 LLM ライブラリと互換性がないのではないかと懸念していました... しかし、最近の再設計がその点でも問題なく対応できることが判明しました。

プロンプトキャッシュの最小値引き下げ。Claude Opus 4.8 におけるキャッシュ可能なプロンプトの最小長は、1,024 トークンです。これは Claude Opus 4.7 よりも低い値となっています。

確認したところ、4.7 の最小値は 4,096 でした。

そしてペリカンたちも

ここでは、思考レベルが低、中、高、超高、最大（max）のすべての 5 つ段階に対して、自転車に乗るペリカンを紹介します。

今回は LLM CLI（大規模言語モデルのコマンドラインインターフェース）を使用して実行し、ログを Markdown 形式でエクスポートした後、Claude Opus 4.8 に依頼して、ページ上で SVG フォームコードブロックを SVG 画像として表示できる HTML ツールを作成してもらいました。

これは最大（max）レベルの結果です。明らかに最も優れていますが、入力トークンが 25、出力トークンが 17,167 かかり、合計コストは 43 セントになりました！

image

Tags: ai, generative-ai, llms, anthropic, claude, pelican-riding-a-bicycle, llm-release

原文を表示

Anthropic shipped Claude Opus 4.8 today. My favourite thing about it is this note in the release announcement:

Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost.

It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model!

Honesty seems to be a theme. Here's my other favorite note from that announcement:

One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

That linked system card includes the following:

Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.

Model characteristics

Not much has changed since 4.7.

It's priced the same as Opus 4.5/4.6/4.7 - $5/million input and $25 per million output. "Fast mode" is twice that price, which is a significant reduction from their previous models - fast mode on 4.6/4.7 remains at $30/$150. Note that fast mode is only available to organizations that are part of the research preview, "Contact your account manager to request access".

Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7.

The context window is still 1,000,000 tokens, and the max output is 128,000 tokens.

The What's new in Claude Opus 4.8 document has some of the more interesting details. These caught my eye:

Mid-conversation system messages. Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops.

See also this update to the Anthropic Python SDK. Being able to steer the system prompt mid-conversation sounds really powerful. I was worried this would be incompatible with the abstraction provided by my own LLM library, which expects a single system prompt per conversation... but it turns out my recent redesign should handle that just fine.

Lower prompt cache minimum. The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7.

I checked and 4.7's minimum was 4,096.

And some pelicans

Here are pelicans riding bicycles for all five thinking levels, low, medium, high, xhigh, and max.

This time I ran them using the LLM CLI, exported the logs to Markdown and then had Claude Opus 4.8 build me an HTML tool that could render that Markdown with the svg fenced code blocks displayed as SVGs on the page.

This is the max one - it's clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of 43 cents!

The bicycle and pelican are the right shape. It

Tags: ai, generative-ai, llms, anthropic, claude, pelican-riding-a-bicycle, llm-release

この記事をシェア

The Zvi重要度42026年5月30日 05:49

Claude Opus 4.8：システムカードの発表

AWS Machine Learning Blog重要度42026年5月29日 02:51

Claude Opus 4.8 が AWS で利用可能に

Anthropic News2026年7月14日 09:00

教師向けClaudeの導入発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Simon Willison Blog·2026年5月29日 08:59·約5分

Claude Opus 4.8：「控えめだが実感のある改善」

#LLM #Claude #Anthropic #Hallucination #Model Safety

TL;DR

AI深層分析2026年5月29日 09:08

重要/ 5段階

深度40%

キーポイント

誠実性と不確実性の管理の強化

ハルシネーション（幻覚）の抑制戦略

価格設定と高速モードの再編

基本料金は従来通りだが、高速モード（Fast Mode）の価格が大幅に引き下げられたものの、現在は研究プレビュー参加組織限定となっている。

技術仕様の維持と新機能

重要な引用

Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor.

Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.

影響分析・編集コメントを表示

影響分析

編集コメント

「劇的な進化」を謳う業界の風潮に対し、あえて漸進的改善と誠実さを強調する姿勢は、AI開発の成熟度を示す象徴的な事例と言えます。

Anthropic は本日、Claude Opus 4.8 をリリースしました。私が最も気に入っているのは、このリリース発表に含まれる以下の注記です。

前モデルに対するわずかな漸進的な改善として、AI ラボが正直にリリースを説明している姿を見るのはとても refreshing です！

誠実さは一つのテーマのようです。その発表から私がもう一つ気に入っている注記をご紹介します。

このリンク先のシステムカードには、以下の内容が含まれています。

モデルの特徴

4.7 版以来大きな変更はありません。

信頼できる知識の更新日（knowledge cutoff）とトレーニングデータの更新日は、4.7 版と同じく 2026 年 1 月です。

コンテキストウィンドウは引き続き 1,000,000 トークン、最大出力トーク数は 128,000 トークンのままです。

確認したところ、4.7 の最小値は 4,096 でした。

そしてペリカンたちも

ここでは、思考レベルが低、中、高、超高、最大（max）のすべての 5 つ段階に対して、自転車に乗るペリカンを紹介します。

image

Tags: ai, generative-ai, llms, anthropic, claude, pelican-riding-a-bicycle, llm-release

原文を表示

Anthropic shipped Claude Opus 4.8 today. My favourite thing about it is this note in the release announcement:

Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost.

It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model!

Honesty seems to be a theme. Here's my other favorite note from that announcement:

One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

That linked system card includes the following:

Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.

Model characteristics

Not much has changed since 4.7.

Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7.

The context window is still 1,000,000 tokens, and the max output is 128,000 tokens.

The What's new in Claude Opus 4.8 document has some of the more interesting details. These caught my eye:

Mid-conversation system messages. Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops.

Lower prompt cache minimum. The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7.

I checked and 4.7's minimum was 4,096.

And some pelicans

Here are pelicans riding bicycles for all five thinking levels, low, medium, high, xhigh, and max.

This is the max one - it's clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of 43 cents!

Tags: ai, generative-ai, llms, anthropic, claude, pelican-riding-a-bicycle, llm-release

この記事をシェア

The Zvi重要度42026年5月30日 05:49

Claude Opus 4.8：システムカードの発表

AWS Machine Learning Blog重要度42026年5月29日 02:51

Claude Opus 4.8 が AWS で利用可能に

Anthropic News2026年7月14日 09:00

教師向けClaudeの導入発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Claude Opus 4.8：「控えめだが実感のある改善」

キーポイント

重要な引用

影響分析

編集コメント

モデルの特徴

そしてペリカンたちも

Model characteristics

And some pelicans

関連記事

Claude Opus 4.8：「控えめだが実感のある改善」

キーポイント

重要な引用

影響分析

編集コメント

モデルの特徴

そしてペリカンたちも

Model characteristics

And some pelicans

関連記事