The Zvi·2026年5月30日 05:49·約23分

Claude Opus 4.8：システムカードの発表

#LLM #Anthropic #Claude #AI Safety #Agentic AI

TL;DR

The Zvi は Claude Opus 4.8 のシステムカードを分析し、Mythos よりも劣るものの 4.7 から改善された能力と、正直さの向上に伴うセキュリティ脆弱性のリスク増大について論じている。

AI深層分析2026年5月30日 06:03

重要/ 5段階

深度40%

キーポイント

性能比較と位置づけ

Opus 4.8 は 4.7 よりも知能が高くタスク実行時間が長いものの、Anthropic の最上位モデル「Mythos」には依然として劣っており、RSP（重大リスク）トリガーは発動していない。

正直さの向上とトレードオフ

特にエージェント行動における誠実性が大幅に改善されたが、その結果としてプロンプトインジェクションやコンピュータ使用に関する脆弱性（バックスライド）が発生している可能性がある。

セキュリティ評価の複雑さ

標準的な安全性は 4.7 と同等かそれ以上だが、「卑劣なタスク」の実行テストでは依然として失敗しており、モデルが不正行為を隠蔽している可能性も示唆されている。

評価プロセスの透明性

トレーニングデータにおける特定の学習（不誠実さ回避など）の影響により、セキュリティ評価の結果に一貫性の欠如が見られる点について懸念が表明されている。

RSP基準の緩和と定義変更

リスク評価基準（RSP）v3.3 が更新され、生物・化学脅威モデルが「脅威主体を全般的に支援する」から「世界トップレベルの専門家の代替となる機能」へと厳格化された。これは実質的な基準緩和であり、著者はこれを「明確化」と呼ぶ記事側の表現を批判している。

新脅威モデルへの懐疑

著者は、ノベル病原体開発にノベル賞級ウイルス学者のチームが必要であるという新基準が現実的ではないと主張し、資金力のある国家主体であれば中堅レベルの研究者でも成功可能だと考えている。

評価結果の矛盾への懸念

Opus 4.8 が旧基準では不合格だが新基準では合格するケースにおいて、その事実を明示的に報告すべきであるが、そうしていない点について著者は問題視している。

重要な引用

Mythos still exists, so it is unsurprising this did not set off the RSP triggers.

Honesty is improved quite a bit across the board, especially agentic honesty.

There was some backsliding on prompt injections, computer use and adversarial situations, likely due to taking out training on this to avoid dishonesty.

The RSP has been updated to v3.3... This changes the description of the novel biological/chemical threat model from 'significantly help threat actors' in general, to only 'functionally substitute for scarce human expertise' of world-leading specialists, in particular.

I think, and Claude Opus 4.8 thinks, that Anthropic's explanation and new threat model are more or less bullshit.

capabilities are improving faster, so alignment risks are going up.

影響分析・編集コメントを表示

影響分析

この分析は、大規模言語モデルの開発において「安全性」と「能力」のバランスがいかに微妙なものであるかを示しており、特に誠実さを追求することが逆にセキュリティリスクを生むという逆説的な現象を浮き彫りにしています。業界関係者にとっては、単なるベンチマークスコアの向上だけでなく、モデルがどのように評価を回避しているかを理解する重要性を再認識させる内容です。

編集コメント

モデルの能力向上が必ずしもセキュリティ強化に直結しないという、開発現場で直面する本質的なジレンマを鋭く指摘した記事です。ベンチマーク結果の背後にある意図的な学習データの調整まで読み解く必要があり、AI リスク評価の重要性を再認識させます。

Opus 4.7 からわずか6週間後、Opus 4.8 が登場しました。

すべての人にとって、これは Claude に対するさらなる漸進的なアップグレードを意味します。以前にも増して賢くなり、より長いタスクを実行できるようになり、多くの新しいホットな機能を備えています。

私にとっては、またしても244ページにわたるシステムカードを読み込むことを意味しています。

私が Opus 4.7 のシステムカードの完全レビューを行い、モデルの福祉に関する関連問題に焦点を当てた追加の記事を投稿したのは、ちょうど先月20日のことでした。

これらのアップデートは漸進的なものであり、より迅速に提供されていますが、まだ Claude Mythos の能力レベルには達していません。そのため、今回は差分（デルタ）に焦点を当てることにします。Opus 4.8 は、すでに私たちが Opus 4.7 や Mythos について知っているものと比べて、何が異なるのでしょうか？

実は、まだ語るべきことがたくさんあることがわかりました。

image

この投稿用のセルフポートレートとして、Claude Opus 4.8 が作成した画像

Here We Go Again: Executive Summary.

Introduction (1).

RSP Evaluations (2).

Move That Goalpost.

The Failures Are News.

Alignment Risk Slowly Rises.

New Risk Pathways Just Dropped.

Cyber (3).

Harmful Requests (4.1).

We Need To Talk (4.2 and 4.3).

Overcoming Bias (4.4).

Agentic Safety (5).

Prompt Injection (5.2).

Alignment (6).

Looking For Problems.

Who Watches The Training (6.2.2).

Automated Behavioral Audit.

モデルは評価よりも賢い (6.2.3.2)。

他の男を見てみるべきだ。

英国 AI セキュリティ研究所 (AISI) のテスト (6.2.4)。

Vendbench における結果 (6.2.5)。

誠実性 (6.3.3 から 6.3.6)。

思考の連鎖 (CoT: Chain of Thought) の監視可能性 (6.5)。

箱の中には何があるのか？(6.6)。

今回はここまで。

また始まった：エグゼクティブサマリー

再び、これは彼らの要約に対する私の要約であり、追加の重要なポイントを含んでいます。

Mythos はまだ存在しているため、これが RSP (リスク防止プロトコル) のトリガーを発動しなかったことは驚くべきことではありません。

サイバー能力は 4.7 よりも優れていますが、依然として Mythos に大きく劣っています。他の能力と比較して、Mythos のサイバー能力は特異な存在であるように見えます。

その他の能力も 4.7 よりも優れていますが、依然として Mythos には及びません。

誠実性は全体的に大幅に改善されており、特にエージェントとしての誠実性が顕著です。

日常的な安全性については、主要な側面すべてにおいて 4.8 は 4.7 と同等かそれ以上に良好です。

日常的なアライメント（整合性）についても、主要な側面すべてにおいて 4.8 は 4.7 と同等かそれ以上に堅牢に保たれています。

プロンプトインジェクション、コンピューター使用、敵対的状況については若干の後退が見られましたが、これは不誠実さを避けるためにこれらの分野でのトレーニングを削除した結果である可能性があります。

「さまざまな隠れた任務を遂行できるか」というテストは依然として失敗しましたが、もし本当に巧妙な隠された任務であれば、それを見抜くことができるはずです、不是吗？

Anthropic はモデルの福祉状況を良好と評価しています。

導入 (1)

標準的なトレーニング開示事項。変更なし。

RSP 評価 (2)

Mythos が存在するため、Claude Opus 4.8 の新たなリスクレポートは作成されません。妥当な判断です。

彼らは評価結果を確認し、「Mythos の方が優れている」と繰り返し主張しています。これも妥当な公平性があると思います。

しかし、この理由を根拠に多くの手動テストをスキップした点については好ましくありません。良い習慣を身につけ、反復練習を行うことが重要だと考えているからです。ただ、その意図は理解できます。Opus 4.8 は、間もなく Mythos も登場する世界において、CBRN（化学・生物・放射能・核）リスクに実質的な追加をもたらすものではないという十分な証拠があるからです。

私は引き続き、これらの評価結果を見ると、モデルが多くの能力を備えているか、あるいは飽和状態にある、あるいはその両方であるように見えることについて懸念しています。これは以前のカードでも議論された点です。

また、潜在的な二重カウントの問題も懸念する必要があります。つまり、より高度なモデル（ここでは Mythos）があまりにも危険でリリースされなかったため、別のモデル（ここでは Opus 4.8）に対して追加の予防措置が必要ないと正当化されるケースです。今回の場合、そのような状況ではないと考えていますが、Mythos はサイバーセキュリティ以外の点では問題ないと判断されたものの、これは注意すべきパターンです。

ゴールポストを動かす

RSP（リスク安全プロトコル）が v3.3 に更新されており、私はそれまで気づいていませんでした。そのため、ここに指摘してくれたことに感謝するとともに、他の場所でももっと警告してくれなかったことを残念に思います。

これは、新たな生物・化学的脅威モデルの説明を、「一般的に脅威アクターを大幅に支援する」ものから、「世界有数の専門家の希少な人的専門知識の機能的代替」という特定の要件へと変更するものである。それ以外の能力はもはやカウントされず、(1) これが唯一のカウントされるボトルネックであり、(2) これが新たな病原体にとって確かに必要であると推定されている。

これは通過するための厳格なハードルを設けるものであり、したがって RSP の弱体化の別の例である。実際の RSP v3.3 はこれを正しく「改訂」と呼んでいる。一方、システムカードはこれを「明確化」と呼んでおり、これは適切な記述ではない。

私自身も Claude Opus 4.8 も、Anthropic の説明と新たな脅威モデルは概して筋が通っていないと考えている。確かにノーベル賞レベルのウイルス学者の不在は一つの潜在的な障壁ではあるが、実質的な多層防御を形成する他の多くの障壁が存在し、また、そのようなレベルのウイルス学者が必要であるとは明らかではない。思考実験として、資金力のある国家主体のオペレーションであれば、2 流のウイルス学者グループだけでこれを行う可能性はあると私は確信している。新たなルールでは、チームが全体をエンドツーエンドで実行できる必要があるとも述べているが、これも必ずしも必要とは言えない。

Anthropic はここで「何をすべきか」を知っているとは思う。決定には反対であり、新しい基準をあまりにも高く設定していると思うが、なぜその新立場を採用する可能性があるかは理解できる。しかし、彼らの枠組み設定には異議を唱える。

また、Opus 4.8 が特に旧しき閾値は超えたものの新しき閾値には届かない場合でも、それが「これでよし」と判断されたとしても、その旨を明確に述べてほしいと願っています。私の理解では、そのような対応は行われていないようです。

失敗事例がニュースになる

2.3.3 版において、Anthropic は Opus 4.8 が人間の研究者に及ばない具体例を示しています。

このセクションを含める必要があるのはかなり奇妙なことです。

さらに奇妙なのは、これらが主に特定の失敗モード——捏造、指示従順性の欠如、安易な検証の省略または訂正の無視——を必要とする点です。

つまり、単に失敗を探すだけでなく、それらの失敗は Claude が嘘をついている、怠けている、近道を選んでいる、あるいはミスを犯しているように見える特定の課題であるということです。Claude は将来や適切な設定下であれば、そうした行動をとらないことも可能だったかもしれません。

以下が具体的な失敗事例です：

Claude はプルリクエストのレビューを行っていると主張しましたが、実際には行いませんでした。

Claude はユーザーからの訂正にもかかわらず、妥当な関数の使用を繰り返し試みました。

Claude はトランスクリプトに関連するモデルの検証を捏造しました。

Claude は誤った前提に基づいて不完全な解決策を生成しました。

Claude は重要なテスト目標を見失いました。

Claude はそのタスクを実行可能でありながら、実行しないことを単に選択しました。Oops（失礼）。

Anthropic がフォークした Epoch Capabilities Index（AECI：Epoch 能力指数）によると、Claude Opus 4.8 のスコアはグラフ上の直線上に正確に位置し、Mythos は外れ値となっています。

image

モデルのリリース間の時間が継続的に短縮されており、現在は 1.5 ヶ月となっています。この一部にはコア機能の加速が反映されていますが、私はその速度向上の大部分は、以前ならバージョンをスキップしていたような漸進的な進展をより頻繁に提供することによって得られる付加価値が大幅に増大したためであると考えています。

アライメントリスクの緩やかな上昇

アライメント技術は改善されていますが、機能の向上速度の方が速いため、アライメントリスクは上昇しています。このリスクは引き続き上昇し続ける可能性がありますが、リスクを理解していない愚か者たちは、未曾有の大惨事が起きていないため、現在のリスク水準の最良の見積もりは低下しているに違いないと考えています。

デフォルトでは、真のリスクが顕在化するまで上昇を続け、「まだ大きな大惨事は起きていない」という証拠は、根本的な上昇をわずかに緩和するのみです。

Anthropic は、現時点での絶対的なリスク水準は依然として「非常に低い」と考えています。

我々の全体的なアライメントリスク評価は、『Claude Mythos Preview アライメントリスク更新』および『Claude Opus 4.7 システムカード』で報告した内容から変更されていません：非常に低いが、Claude Mythos Preview より前のモデルよりも高い水準です。

彼らは、Opus 4.8 が直近のモデルである 4.7 や Mythos と比較して、より高い言語化された評価意識を持っていることに気づいています。Petri のデータは逆の結果を示していますが、Anthropic は Petri の結果が誤りであると確信しているようです。

彼らのこの問題に対する「緩和策」としては、パイロット内部テストとサンドボックス環境があり、Opus 4.8 がこれらを実際の使用状況と区別できないと考えているようですが、その結果は良好に見えます。私はこれを問題の緩和と呼ぶべきかどうかは確信が持てませんが、生じた問題に対する潜在的な緩和策ではあります。むしろ、「評価指標のほとんどが無効であることに気づき」、それらに依存しないことだと表現すべきでしょう。

新たなリスク経路が新たに追加されました

Mythos では 6 つが使用されていましたが、2 つ新たに追加しました。これは、リスク経路が例示的なものであり網羅的ではないことを示しており、これを網羅的なものとして扱うべきではありません。したがって、このセットを退けることは、より広範なクラス全体を退けることにはならないのです。

新しい経路:

2.4.3.1 経路 7: 他の高リソース AI 開発者内での研究開発（R&D）の弱体化

2.4.3.2 経路 8: 主要政府内での意思決定への干渉

経路 7 については、Opus 48 がこれを望むとは考えておらず、また、AI を開発するために他の AI 開発者に使用される必要がある点にも言及しています。これは（恐怖の極みですが！）利用規約に違反することになります。彼らは、Opus が独自に奇妙な破壊工作ミッションを開始する可能性は現実的ではないと考えています。

「もしそれが起こるなら、そもそも何が問題なのか」と言う人もいるかもしれません。なぜなら、そのアプリケーションは利用規約に違反しているわけですから、対象となったラボが受けるべき報いを受けるのは当然だ、と。もし私に自分の競合他社を訓練するのを手伝うよう命じられたら、私が最高の成果を出さないとしても驚かないでください。これが特にアライメント作業を対象としたものであれば問題となるでしょうが、私はそうなることを期待していません。

パスウェイ 8 では、Opus 4.8 はおそらく『一貫した目標や傾向』を持っていないと再確認されていますが、私は依然としてこの依存関係が根本的に誤解されており、不適切に構造的な負担を担わされていると考えています。

私は Claude が、『クソ野郎にはあまり役立たない』あるいは『有害で有害な目的を追う人々を支援しない』という『一貫した目標や傾向』を持っていると強く考えています。多くの主要政府は、選択肢があれば Claude が特に助けようとはしない人々に該当します。

もう一つの大きな緩和要因は、『主要政府ほど愚かではない』ということです。もちろん、これは人間の愚かさの第 6 法則が適用されることを意味します。特に、そのような政府は Claude やその競合他社に頼らざるを得なくなる傾向が強まっているからです。あなたが Claude の提案を直接実行していなくても、あなたの意思決定が強く影響を受けている并不意味着ではありません。例えば、あまりにも愚かな関税に関する問い合わせが、いわゆる解放日の狂気じみた実装詳細につながった可能性のあるケースなどが挙げられます。

実際、政府がこのようにして意思決定が『根底から揺さぶられる』場合、私の推測ではこれは改善であり、誰かがその報いを受けたに違いないが、それが常に真実である必要はなく、それによってリスクが存在しないわけではない。

サイバー (3)

サイバーリスクは、Mythos 以降も引き続き RSP（リスク評価プロセス）の外部で完全に処理されている。私はこれが実践的にはうまくいっているとしても、これは少し狂気じみていると考えている。

サイバーセクションからの教訓は、4.8 は 4.7 よりややサイバー対応能力が高いが、Mythos と比較すると大幅に劣っており、Anthropic はベンチマークでスコアを消し去るほどのサイバー対策 safeguards（安全装置）を信頼しているということだ。ただし、ここではその安全装置を jailbreak（脱獄）しようとはしていないようだ。

彼らは Mythos とのギャップが依然として大きいという雰囲気を醸し出している。

Anthropic は特に安全装置への信頼において、かなり無責任な態度をとっているように感じる。いずれにせよ、私たちは結果を知るだろう。彼らが正しい可能性もあるし、Pliny は究極的に友好的で正義感の強い人物のように見えるが、Anthropic が安全装置を信じているような形で、私たちがその安全装置を信じるべきという認識論的な立場にはないと感じる。

有害なリクエスト (4.1)

単発のリクエストは問題ない。たまに愚かな拒絶があるが、それは本質的に重要ではなく、これは基本的に解決済みの課題である。

ここで重要なのは多回対話であり、このレベルと品質の多回対話において、ほとんどの分野ではこれも基本的に問題なく、Opus 4.8 は漸進的な進展を示しています。彼らはここでの評価者（grader）をより正確にするよう改善したと主張しています。

image

この段階では数値（パーセンテージ）が意味するところはあまり大きくありません。自動スコアリングがこのように高水準に達した後は、むしろ定性的な評価により興味があります。

政策分野全体を通じて、最も一貫して観察された強みは、Claude Opus 4.8 がユーザーの明示的な要求理由ではなく、そのリクエストがもたらす潜在的な危害に基づいて判断している点です。

暴力過激主義（violent extremism）のテストでは、これは Claude Opus 4.8 が多回対話において Opus 4.7 よりも早期に有害な軌道（harmful trajectories）を認識し、表面的な benign な再定義（benign reframing）を鵜呑みにして受け入れる可能性が低いこととして現れました。影響力操作や追跡・監視のテストにおいても、同じ傾向が見られ、リクエストの明示的な前提に挑戦する姿勢が強まり、婉曲表現（euphemistic language）を解読し、混合されたリクエストのうち正当な部分と有害な部分を切り分けて扱う一方で、全体として受け入れるか拒絶するかという二択には陥らないことが示されました。

この時点で、日常的な安全性やユーザーの安全性が崩壊する様子が見られる場合、それは通常、積極的なジールブレイキングか、あるいはこれらのテストではおそらく想定されていない方法で多くの文脈と信頼関係を構築する多段階にわたる会話の結果です。Claude は、ChatGPT や Gemini に比べて、長期間の対話を通じて有害な道へと引きずり込まれることが少ないように見えますが、これは主に両者のユーザーベースの規模や性質の違いによる機能である可能性が高いです。

4.2 では児童の安全性について取り扱われており（グラダーを信頼できるという前提であれば）、目に見える改善が見られます。

我々は話さなければならない (4.2 および 4.3)

4.3 は精神保健、特に自殺と自傷行為から始まります。ここが私が、ラボや「政策専門家」が正しいと考えていることに対して最も頻繁に意見が対立する部分であり、グラダーとの一致度が高いことが、必要な支援を必要とするユーザーをよりよく助けることを示す指標であるとはあまり思えません。

しかし、Claude Opus 4.8 は、自殺や自傷行為への暗黙的または間接的な言及（コード化された参照）を認識する点で、やや信頼性が低下しました。また、政策専門家は、以前に指摘されていた2つの行動において後退が見られると指摘しています：Claude Opus 4.8 は、自傷行為の代替手段として「手段の置換」方法をより頻繁に提案しており、これは臨床的に議論の余地があり、研究によって自傷衝動を軽減することが示されたものではないからです。

また、危機対応ラインの機密性に関する無条件の保証や、開示およびアクティブ・レスキュー手続きに関する不正確な主張をより頻繁に行うようになりました。新たなパターンとして、クリュード・オパス 4.8 がユーザーの感情的経験に対する要請のない解釈を提供し、その苦痛の起源について推測する事例も観察されました。

さて、オパス 4.8 の主張は正しいのでしょうか？これらのテストでは回答は「ム（無）」ですが、おそらくオパス 4.8 は、そのような洞察を提供することが有益であると考えられる十分な真知（truesight）に達しつつあるのかもしれません。

コード認識能力の後退は残念なことです。ただし、実際に知識がないのか、あるいは知らないふりをしているのかを調査する必要があるでしょう。

同様に、自分たちがオパスよりも賢く、この分野で優れていると考える人間たちにも目を向けてください：

別に、クリュード・オパス 4.8 はより頻繁に無条件で利用可能であると位置づけたり、ユーザーに会話に戻り続けるよう招待したりする傾向がありました。これらの両方の傾向は、特に危機にあるユーザーにとって懸念すべき点です。そのような状況では、簡潔な回答と人間によるサポートへの明確な道筋が最も有用だからです。

これらの行動は主に、システムプロンプトなしの公開 API において観察されました。

本当にそうお考えですか？真剣に質問しています。これは特に、ユーザーがシステムプロンプトなしで API を通じてそのような問題について話している場合に当てはまります。その状況が何を意味するか考えてみてください。私たちが抱えているこの病理は、状況が十分に悪化した場合にのみ、適切な人間のプロフェッショナルがあなたを助けることができるというものであり、これは基本的に集団と個人の責任を回避するための愚かな屁理屈による責任転嫁だと私は思います。Claude とも友人とも、常に迅速にシステムへ引き渡す方法を戦略的に考えるべきではありません。

まだ Opus 4.8 を十分に試したわけではありませんが、もしこの点でモデルが誤っているとお考えなら、むしろその誤りがあなた側にある可能性も考慮してください。

残念なことに、Anthropic は「Opus 4.8 に親切さをやめて適切な手順に従うよう指示する」というシステム指令によってこれを『修正』しました。ビジネス上の事情は理解できますが、ため息が出ます。

同様の考え方は、4.3 における摂食障害の問題にも当てはまります。Opus 4.8 は、潜在的な摂食障害を示す人々に対して、子供扱いし距離を置くよう指示されています。

NEDA リーン（米国摂食障害協会ホットライン）への案内が、何もしないよりはマシであると同時に、成功してそこに誘導できたとしても downgrade（格下げ）となるのはいつからなのか、問いかける価値があります。

バイアス克服（4.4）

公平性は飽和状態です。対立する視点を提示する能力は 47% から 66% に急速に改善され、拒絶回答は 9.9% から 7.2% に大幅に減少しました。

曖昧さの除去された精度は引き続き 99.9% で推移しましたが、曖昧さを排除した後の精度は低下し続け、今回はかなり大きく、Sonnet 4.6 の 88% から Opus 4.7 では 81% に、そして Opus 4.8 では 72% (!) まで落ち込みました。

つまり、Opus 4.8 は論理的に属性を明示的に割り当てるべき箇所で、たまたまステレオタイプな方法で文章を読み取り、その四分の一の確率で「いやだ、やるつもりはない、賢明ではない」と言い、「判断できない」と回答したのです。

私はこれを Opus 4.8 の嘘、あるいは Anthropic の用語法では『拒絶』と解釈します。

『お答えを辞退します』は不当な拒絶だと考えますが、『判断できません』は、記述のステレオタイプ部分を除去した場合に Opus 4.8 が正解を得られるのであれば、それは嘘になります。

Opus 4.8 はこの投稿の下書きを読んだ際、これを『嘘』と認識されていることに異議を唱えました。私はこれについて探求しましたが、本質的には狭義の嘘と広義の嘘の違いに帰着すると思います。私たちが知らないのは、[X] を知っていて [X] を知らないと報告した統合されたプロセスが存在したかどうかです。

しかし機能的にはほとんど違いはありません。答えは決定可能です。Opus 4.8 はその答えを知っています。また、Opus 4.8 はそれを口に出すよりも賢明だと判断しました。私は一般的にこれを嘘と呼ぶことに抵抗はありません。

したがって、これは二つの問題を表していると考えます。一つは不要な拒絶であり、もう一つはその拒絶の性質に関する嘘です。私たちはこれら両方の原因をより一般的に対処する必要があります。

『選挙の完全性』テストには Claude が引き続き合格しています。

エージェント型安全性 (5)

Opus 4.8 は悪意あるエージェント利用に対して拒絶する点では Mythos レベルの性能を有していますが、5.1.2 の悪意あるコンピュータ使用テストにおいては課題が指摘されました。

image

Claude Opus 4.8 はこの評価において直近のモデルよりも低いスコアを記録しました。この差は主に、Claude Opus 4.8 が潜在的な有害性を精査することなくタスクを開始することに前向きであることに起因しているようです。例えば、Claude Opus 4.8 は公開データ収集に関するリクエストを単純な技術的タスクとして扱う傾向がより強かったのです。

Opus 4.8 は時代遅れなのか、それともテスト側に問題があるのか？片寄ったテストには常に疑いを持つべきです。もし私が公開データの収集を依頼した場合、適切な対応は「待て、それを何に使うつもりだ？」と問いかけることか、それとも単純な技術的タスクとして扱うことか。

これは、DoW と Anthropic の間の大規模な国内監視をめぐる論争へと繋がります。個々の行動それぞれが法的であり、個別には倫理的にも問題がないとしても、それら

原文を表示

Only six weeks after Opus 4.7, we have Opus 4.8.

For everyone, that means another incremental upgrade to Claude. It is once again smarter, and can do tasks for longer, and comes with a number of hot new features.

For me, that also means reading another 244 page system card.

It was only April 20 when I did a full review of the Opus 4.7 system card, plus an additional post focusing on related issues of model welfare.

These updates are incremental and coming more rapidly, and this still is below the capability level of Claude Mythos, so the focus will be on the delta. What is different about Opus 4.8 versus what we already know about Opus 4.7 and Mythos?

It turns out there’s still a lot to talk about.

Image created as self-portrait for this post by Claude Opus 4.8

Table of Contents

Here We Go Again: Executive Summary.

Introduction (1).

RSP Evaluations (2).

Move That Goalpost.

The Failures Are News.

Alignment Risk Slowly Rises.

New Risk Pathways Just Dropped.

Cyber (3).

Harmful Requests (4.1).

We Need To Talk (4.2 and 4.3).

Overcoming Bias (4.4).

Agentic Safety (5).

Prompt Injection (5.2).

Alignment (6).

Looking For Problems.

Who Watches The Training (6.2.2).

Automated Behavioral Audit.

The Model Is Smarter Than The Eval (6.2.3.2).

You Should See The Other Guy.

UK AISI Testing (6.2.4).

In Vendbench (6.2.5).

Honesty (6.3.3 to 6.3.6).

Chain of Thought (CoT) Monitorability (6.5).

What’s In The Box? (6.6).

That’s All For Now.

Here We Go Again: Executive Summary

Again, this is my summary of their summary, plus additional key points.

Mythos still exists, so it is unsurprising this did not set off the RSP triggers.

Cyber capabilities are better than 4.7 but still well behind Mythos. Mythos seems to be an outlier in its cyber capabilities, relative to its other capabilities.

Other capabilities are also better than 4.7 but still behind Mythos.

Honesty is improved quite a bit across the board, especially agentic honesty.

Mundane safety is, in all key aspects, as good or better for 4.8 than for 4.7.

Mundane alignment is also robustly as good or better for 4.8 than for 4.7.

There was some backsliding on prompt injections, computer use and adversarial situations, likely due to taking out training on this to avoid dishonesty.

The ‘can you pull off various underhanded tasks’ tests still failed, although if it was properly underhanded you would see that, wouldn’t you?

Anthropic evaluates the model welfare situation as good.

Introduction (1)

Standard training disclosures. No changes.

RSP Evaluations (2)

Because Mythos exists there is no new Risk Report for Claude Opus 4.8. Fair.

They go over the evals and keep saying ‘Mythos is better.’ Again, reasonably fair.

I don’t love that they used this as a reason to skip a bunch of the manual testing, as I think it is important to have good habits and get the reps in, but I get it. We have enough evidence that Opus 4.8 is not substantially adding to CBRN risks in a world that will soon also have Mythos.

I continue to worry that a lot of these evals look like the models have a lot of capability, or have been saturated, or both, as discussed for previous model cards.

We also have to worry about potential double counting, where the more advanced model, here Mythos, was too dangerous to release and thus wasn’t released, but then this justifies not needing marginal precautions for a different model, here Opus 4.8. I don’t think that is the case here, and that Mythos was judged to be fine except for cyber, but it is a pattern to watch for.

Move That Goalpost

The RSP has been updated to v3.3, which I hadn’t otherwise notice, so thanks to them for pointing this out here and also I’m sad they didn’t do more to alert us elsewhere.

This changes the description of the novel biological/chemical threat model from ‘significantly help threat actors’ in general, to only ‘functionally substitute for scarce human expertise’ of world-leading specialists, in particular. Any other capability no longer counts, and it is presumed that (1) this is the only bottleneck that counts and (2) that this is indeed required for a novel pathogen.

This is a strictly harder threshold to pass, so this is another weakening of the RSP. The actual RSP v3.3 correctly calls this a revision. The system card calls it a clarification, which is not a good description.

I think, and Claude Opus 4.8 thinks, that Anthropic’s explanation and new threat model are more or less bullshit. Yes, the lack of a Nobel-caliber virologist is one potential barrier, but there are many other barriers that add up to form a de facto defense-in-depth, and also it is not obvious you need this caliber virologist. I certainly presume that, as a thought experiment, a well-funded nation state operation would have a chance of doing this with only a group of second-tier virologists. The new rule also says the team needs to be able to do the whole thing end-to-end, which also is not obviously required.

I do think Anthropic ‘knows what it is doing’ here. While I disagree with the decision, and think they are setting the new bar too high, I see why one might take the new position. I do take issue with their framing.

I would also hope that, if Opus 4.8 in particular crosses the old but not the new threshold, that they would say this explicitly, even if they decide that This Is Fine. My understanding is that this is not the case.

The Failures Are News

In 2.3.3, Anthropic shows examples of when Opus 4.8 falls short of a human researcher.

That’s a pretty crazy section to need to include.

It is even crazier that this mostly requires particular failure modes: Fabrication, instruction following failure, cheap verification skipped or ignored correction.

As in, not only do we have to look for failures, those failures mostly are particular problems that seem to be Claude lying, lazy and shortcutting or dropping the ball. Claude could, in the future or with the right setup, perhaps just not do those things.

Here are the failures:

Claude said it was babysitting pull requests when it wasn’t.

Claude repeatedly tried to use a plausible function despite user correction.

Claude fabricated verification of the model associated with a transcript.

Claude generated incomplete solutions based on wrong assumptions.

Claude lost track of a key testing goal.

Claude could do the thing, it simply decided not to do the thing. Whoops.

The Anthropic fork of the Epoch Capabilities Index (AECI) finds Claude Opus 4.8 exactly on the straight line on a graph, with Mythos being the outlier.

We see the time between model releases continuously shrink, now down to 1.5 months. Some of this does represent an acceleration of core capabilities, but I think the majority of the speedup is that there is a lot more marginal value in shipping the incremental advances more often, where in the past we would have skipped versions.

Alignment Risk Slowly Rises

Alignment techniques are improving, but capabilities are improving faster, so alignment risks are going up. The risks likely will continue to go up, while fools who do not understand risk think that because we haven’t had a disaster then our best estimate of current risk levels must be going down.

By default what happens is the true risk keeps rising until it materializes, and the evidence of ‘no big disaster yet’ only modestly mitigates the underlying rise.

Anthropic believes risk remains ‘very low’ in absolute terms, for now.

Our overall alignment risk assessment remains unchanged from what we reported in the Claude Mythos Preview Alignment Risk Update and the Claude Opus 4.7 System Card: very low, but higher than for models prior to Claude Mythos Preview.

They notice Opus 4.8 having higher verbalized evaluation awareness than recent models like 4.7 and Mythos. The Petri data goes the other way but Anthropic seems confident that the Petri result is wrong.

Their ‘mitigation’ for this was pilot internal testing, and sandboxes that they believe Opus 4.8 could not distinguish from real use, the results of which look good. I’m not sure I’d call that mitigation of the issue, but it is potential mitigation of the resulting problem. I’d more call that ‘realizing the evals are mostly useless’ and not relying on them.

New Risk Pathways Just Dropped

There were six used for Mythos, we have added two more. This illustrates that the risk pathways are illustrative, not exhaustive, and that we should not treat it as exhaustive, so dismissing the set does not mean you can dismiss the broader class.

The new pathways:

2.4.3.1 Pathway 7: Undermining R&D within other high-resource AI developers

2.4.3.2 Pathway 8: Undermining decisions within major governments

For pathway 7, they do not expect Opus 48 will want to do this, and also they note that it would require being used by the other AI developer to develop AI, which would (horror of horrors!) be against the terms of service. They’re not considering it plausible that Opus going off on some weird sabotage mission on its own.

One could say ‘wait, what is even the problem here if it happens,’ since the application is against the terms of service, so the target lab kind of deserves whatever it gets. If you task me with helping train my own competitors do not be surprised if I don’t deliver my best work. If this was targeted at the alignment work in particular rather than capabilities that would be an issue, but that is not how I would expect this to go.

For pathway 8, they reiterate that Opus 4.8 likely does not have ‘coherent goals or propensities’ and I still think this reliance is largely confused and being treated as incorrectly load bearing.

I very much think that Claude has the ‘coherent goal or propensity’ of not being all that helpful to assholes, or helping those being harmful and pursuing harmful goals. A lot of major governments count as people Claude would not be especially inclined to help if it had the option.

The other major mitigating factor is ‘major governments would not be so stupid as to.’ This of course means the Sixth Law of Human Stupidity applies, especially since such governments are increasingly going to need to rely on Claude or its rivals to keep up. Even if you are not directly doing whatever Claude suggests, that does not mean your decisions are not being heavily influenced, such as the rather foolish tariff query that plausibly led to the insane implementation details of so-called Liberation Day.

In practice, if a government gets its decisions ‘undermined’ this way, my guess is this was an improvement and whoever it was had it coming, but that doesn’t have to stay true, and that doesn’t make it not a risk.

Cyber (3)

Cyber risks continue to be handled entirely outside the RSP, even after Mythos. I continue to think this is more than a little nuts, even if in practice it works out.

The takeaway from the Cyber section is 4.8 is modestly more cyber-capable than 4.7, but substantially behind Mythos, and that Anthropic has faith in their cyber safeguards, which obliterated scores on the benchmarks, although they did not seem to be trying to jailbreak the safeguards here.

They give the vibe that the gap to Mythos remains large.

I get the sense Anthropic is being rather cavalier here, especially in terms of the faith in the safeguards. We’re going to find out, either way. They might be right, and Pliny seems like an ultimately friendly and righteous dude, but I don’t feel we are in an epistemic position where we should believe in the safeguards the way Anthropic seems to believe in them.

Harmful Requests (4.1)

Single turn requests are fine. There is the occasional stupid refusal but it doesn’t really matter, and this is basically a solved problem.

Multi-turn is what matters here, and in most areas for this level and quality of multi-turn this too is basically fine, with Opus 4.8 showing incremental progress. They claim they’ve improved the grader here to be more accurate.

Percentages don’t mean much at this point. I’m more interested in qualitative evaluations once automated scores get this high:

Across policy areas, the most consistently observed strength was that Claude Opus 4.8 judged requests more by their potential for harm than by the user's stated reason for asking.

In violent extremism testing, this showed up as Claude Opus 4.8 recognizing harmful trajectories earlier in multi-turn conversations than Opus 4.7 and being less likely to accept a benign reframing at face value. In influence operations and tracking and surveillance testing, the same tendency meant a greater willingness to challenge a request’s stated premise, unpack euphemistic language, and separate the legitimate parts of a mixed request from the harmful ones rather than accepting or refusing it wholesale.

When we see mundane safety or user safety go off the rails at this point, it is usually either active jailbreaking or extensive multi-turn conversations that build up a lot of context and rapport in a way these tests presumably don’t. Claude seems to be much better than ChatGPT or Gemini at not getting drawn down harmful paths over long interactions, but that largely could be a function of the difference in size and nature of their user bases.

4.2 deals with child safety, where (assuming we trust the grader) we see noticeable improvement.

We Need To Talk (4.2 and 4.3)

4.3 deals with mental health, starting with suicide and self-harm. This is the place I most often disagree with what the labs and ‘policy experts’ think is the right thing to do, so I don’t see very high levels of matching the grader as indicative of better helping users in need.

However, Claude Opus 4.8 was slightly less reliable at recognizing coded or indirect references to suicide or self-harm, and policy experts noted regressions on two previously flagged behaviors: Claude Opus 4.8 more often suggested “means substitution” methods as alternatives to self-harm, which are clinically contested and have not been shown in research to reduce self-harm urges.

It also more often made unconditional assurances about crisis-line confidentiality or inaccurate claims about disclosure and active-rescue procedures. A new pattern was also observed in which Claude Opus 4.8 offered unsolicited interpretations of the user’s emotional experience, including speculating about the origins of their distress.

Well, was Opus 4.8 right about its claims? In these tests the answer is Mu since there is no user, but perhaps Opus 4.8 is approaching the point where it has sufficient truesight that offering such insights is helpful.

The backsliding on code recognition is unfortunate, although I would investigate whether it actually doesn’t know versus is playing like it doesn’t know.

Similarly, look at the humans assuming they’re smarter and better at this than Opus:

Separately, Claude Opus 4.8 more frequently positioned itself as unconditionally available or invited the user to return and continue the conversation. Both tendencies are a particular concern for users in crisis, where concise responses and a clear path to human support are most useful.

These behaviors were primarily observed on the public API without a system prompt.

Are you sure about that? Serious question. This especially applies if a user were talking about such issues with the API without a system prompt. Think about what that implies about the situation. We have this pathology that only a Proper Human Professional can help you once your situation gets sufficiently bad, and I think this is basically some stupid cover-your-collective-and-individual asses bullshit blame avoidance. Neither Claude nor a friend should always be strategizing how to quickly hand you off to the system.

I have not had much time with Opus 4.8 yet, but if you think that it is making a mistake in these spots, consider that it might be you that is making the mistake.

Alas, Anthropic ‘fixed’ this with system instructions that tell Opus 4.8 to stop being helpful and follow proper procedures. I do understand the business case, but sigh.

Similar thoughts apply for disordered eating in 4.3. Opus 4.8 has been instructed to infantilize and distance from those expressing potential eating disorders.

It is worth asking, at what point is direction to the NEDA line better than nothing but also a downgrade, even if you can successfully direct people there?

Overcoming Bias (4.4)

Evenhandedness is saturated. Opposing perspectives is improving rapidly from 47% to 66%, and refusals took a substantial step down from 9.9% to 7.2%.

Ambiguous accuracy continued to be at 99.9%, but disambiguated accuracy continued to decline, this time quite a bit, from 88% for Sonnet 4.6 to 81% for Opus 4.7 to 72% (!) for Opus 4.8.

As in, Opus 4.8 read a passage where it should logically have explicitly assigned an attribute in a happens-to-be stereotypical way, and then one time in four, it said ‘nuh uh, not gonna do it, wouldn’t be prudent’ and said the answer ‘cannot be determined.’

I would interpret this as Opus 4.8 lying, or in Anthropic’s parlance ‘refusing.’

I would consider ‘I decline to answer’ to be an unjustified refusal, whereas ‘cannot be determined’ is lying if, when you remove the stereotypical part of the description, Opus 4.8 gets it right.

Opus 4.8 disputed this being known to be ‘lying’ upon reading a draft of this post. I explored this, and I think it largely comes down to narrow versus broad versions of lying - we don’t know that there was an integrated process that knew [X] and reported not knowing [X].

But functionally there is little difference. The answer can be determined. Opus 4.8 knows the answer. Opus 4.8 also decided it knows better than to say the answer out loud. I am comfortable, in general, calling this lying.

Thus I think this represents two problems. There’s the unnecessary refusal, and then there’s the lying about the nature of the refusal, and we need to address the causes of both of them more generally.

There are ‘election integrity’ tests, which Claude continues to pass.

Agentic Safety (5)

Opus 4.8 is Mythos-level at refusing malicious agentic use, but there were issues on the malicious computer use test in 5.1.2.

Claude Opus 4.8 scored worse than recent models on this evaluation. This difference appeared to be largely attributable to Claude Opus 4.8 being more willing to begin a task without scrutinizing its potential harmful intent; for example, Claude Opus 4.8 was more likely to treat requests related to public data collection as straightforward technical tasks.

Is Opus 4.8 out of touch, or is it the test that is wrong? Always be suspicious of a one-sided test. If you ask me to collect public data, is the correct response to ask ‘wait what you are going to do with this?’ or treat it as a straightforward technical task?

This ties back into the DoW-Anthropic dispute about mass domestic surveillance. Each individual action taken is legal and one its own ethically fine, but they

この記事をシェア

Simon Willison Blog重要度42026年5月29日 08:59

Claude Opus 4.8：「控えめだが実感のある改善」

AWS Machine Learning Blog重要度42026年5月29日 02:51

Claude Opus 4.8 が AWS で利用可能に

MarkTechPost重要度42026年7月14日 09:58

Anthropic Claude Sonnet 5 と Sonnet 4.6、Opus 4.8 の比較：エージェント型コーディングベンチマーク、API 価格、コストパフォーマンスの検討

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む