読み込み中…

TLDR AI·2026年6月10日 09:00·約16分

Claude Fable 5 と新たな AI セーフティ・ファブル（14 分読了）

#LLM #Anthropic #AI Safety #Claude Fable 5 #Model Performance

TL;DR

Anthropic が一般公開向けに「Claude Fable 5」を発表し、性能が劇的に向上する一方で、ユーザーへの明示的告知なしを含む厳格な安全対策を同時に導入したことは、AI 業界における安全性と能力開発のバランスに関する重要な転換点を示している。

AI深層分析2026年6月11日 01:05

重要/ 5段階

深度40%

キーポイント

Claude Fable 5 の性能と価格

一般公開されているモデルの中で最も賢く、あらゆるベンチマークで大幅な進歩を遂げたが、現在の Opus モデルの約 2 倍という比較的高い価格設定となっている。

不均衡な安全対策の実装

ユーザーに明示的に通知される措置と、モデル内部で無言のうちに適用される措置が混在する「不均一な」安全ポリシーが導入され、これが業界の教訓となる可能性が指摘されている。

技術的ブレークスルーの背景

推論時間のスケーリングや RL などの明確な新手法ではなく、スタック全体（データ、アーキテクチャなど）にわたる進歩によって性能向上が達成されたと分析されている。

業界への示唆と安全性の限界

Anthropic が現在のリードを維持・強化するために「より強硬な」安全対策を採用した背景には、AI 能力開発における壁がすぐには存在しないという現実がある。

重要な引用

Claude Fable 5 is definitely the smartest model available to the general public

The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.

There's no clear breakthrough associated with this model, such as inference-time scaling or RL... achieved by advances across the whole stack

影響分析・編集コメントを表示

影響分析

この記事は、AI モデルの開発において「安全性」と「能力」をどう両立させるかという根本的な課題が、単なる技術的問題から戦略的・倫理的なジレンマへと変化していることを示唆しています。特に、ユーザーに知らされずに適用される安全対策の導入は、透明性の欠如や規制の不均衡に対する業界全体の警戒感を高める要因となるでしょう。

編集コメント

性能の飛躍的向上と、その影で静かに強化される安全規制のせめぎ合いが、今後の AI 開発の行方を占う重要な分岐点となっています。透明性を欠く安全対策の実装は、ユーザー信頼や業界標準に長期的な影響を与える可能性があります。

本日、Anthropic は Claude Fable 5 モデルを一般消費者および企業向けにリリースしました。これは同社の Mythos クラスモデルの一般アクセス版です。これに伴い、Anthropic は一連の安全対策を導入しました。ユーザーに対して明示的に告知されたものもあれば、ユーザーに知らせずにモデル自体を変更するものもあります。AI 能力における次の主要なステップが、Anthropic の現在の優位性を保護し、あるいは固定化しようとする意図を示す、より強硬な安全措置を伴って到来したことは、それほど驚くべきことではないはずです。

Anthropic が導入した適用に偏りのある安全ポリシーは、安全性や制御に関する狭義かつ自己成就的な概念がほとんど機能しないことを示す古典的な教訓物語として定着する軌道にあります。

安全性の事柄の詳細に踏み込む前に、まずこのモデルの品質を確認することが重要です。今日における重要性を決定づけるのはモデルの品質であり、これらの安全機能は、現代の LLM においてこれまで経験したことのない形で、最先端 AI へのアクセスの形を意味あるものに変化させています。第二に、その能力は、この物語がさらに加速する方向を示しています。[再帰的自己改善] はここからの進歩に対する適切な思考モデルとは言えませんが、Claude Fable 5 は、LLM のトレーニングにおいて即座に壁が存在しないことを非常に明確に示すはずです。

まず始めに、Claude Fable 5 は一般公衆が利用可能な中で間違いなく最も賢いモデルであり、今日のほぼすべての関連ベンチマークにおいて驚異的な飛躍を遂げています。その価格は現在の Opus モデルのわずか 2 倍1（これはまだ GPT 5.5 Pro のバリアントよりも安価です）という破格のものです。これだけでも、この分野における画期的な瞬間と言えます。ChatGPT 後の LLM ラスレースが始まってから数年が経過した中で、モデルのバージョンアップがこれほど能力面で大幅な一歩を踏み出すことは驚異的です。推論時のスケーリングや RL（強化学習）といった、このモデルに明確に関連する画期的な突破は存在せず、一般的な見解ではこれはスタック全体（もちろん、確実にはわかりません——文書化されていないため）における進歩によって達成されたものと考えられています。これは主要な技術的達成であり、このモデルを構築した従業員たちはその仕事に対して非常に誇りを持つべきです。

このモデルはトレーニング完了後、2 ヶ月以上遅れて一般公開されました2。AI エコノミーにおける競争の力学を考慮すると、このモデルのより賢いバージョンはすでに順調に開発が進んでいます。

続きまして、本モデルのベンチマーク結果は以下の通りです。

これらのスコアにあるアスタリスク (*) は、公開される際にユーザーが得られるスコアとは必ずしも一致しないことを示しています。これは、現在のモデルに搭載されている安全性フィルターにより、一部のプロンプトが Opus 4.8 に格下げされる可能性があるためです。

私はこのベンチマークスコアの飛躍的な向上を見て、モデルを実際に徹底的にテストしなくても、これが驚異的なツールであることがすぐにわかります。Anthropic は、OpenAI や Gemini と比較した場合、特にベンチマーク結果に対して最も関心が薄い AI 研究所として知られていることを思い出してください。私が 2025 年 6 月に述べたコメントを振り返ってみてください：

**これは業界にとって異なる道であり、私たちが慣れ親しんだものとは異なる形式のメッセージングになります。今後のリリースは Anthropic の Claude 4 のようなものが増え、ベンチマークでの改善はわずかでも、実世界での恩恵は大きな一歩となるでしょう。これに伴い、政策、評価、透明性に関する多くの含意が生じます。特に AI 批判者が評価の横ばいを機会として捉え「AI はもはや機能していない」と主張するようになる中で、進捗ペースが継続しているかどうかを理解するには、より繊細なニュアンスが必要になります。

明らかに、進捗ダイナミクスの一部は変化しましたが、それは別の日の記事に譲ります。私は今年、新しいモデルについて複数の投稿を書いており、特にベンチマークを信頼することがいかに難しいか（そして部分的にはベンチマーク自体があまり変化しないため）について述べています。全体として、これはおそらく二度と意味のあるコードを書かないだろうと気づき、エージェントを中心に新しいワークフローを開発する必要がある AI に精通した労働者にとって、大きな検証となります。

このリリースには、必須のデータ保持ポリシーや追加されたプロンプトフィルターなど、複数の安全性ツールに関連する要素が含まれています。この分析を通じて、これらの要素のうちどの部分が害を及ぼしているのか、また包括的なポリシーの中で単一の要素が不適切な配置となることでなぜ全体の安全性プロセスに致命的な影響を与えるのかについて、正確かつ明確であることが特に重要です。

サイバーセキュリティ、標的型モデルの蒸留、研究生物学という重点分野において、Anthropic はブログ記事で新しい安全性分類器（classifiers）について詳細を述べています。

Fable 5 には、不正使用（脱獄試行を含む）を検出し、メインモデル（この場合は Fable 5）が応答しないようにする別個の AI システムである一連の*分類器*が付属しています。私たちはこれまで数ヶ月にわたりモデルに対して分類器を実行しており [1]、Fable 5 の分類器は、追加のカバレッジを備えたこの先行研究の拡張版です。

[1]: https://www.anthropic.com/research/next-generation-constitutional-classifiers

ファブルの分類器がサイバーセキュリティ、生物学および化学、あるいは蒸留に関連するリクエストを検出した場合、応答は自動的に Claude Opus 4.8 によって処理されます。ユーザーにはこの発生時に必ず通知が行われます。Opus 4.8 はそれ自体が非常に能力の高いモデルであり、ファブルからのフォールバック（代替）による応答は、 outright な拒絶よりもはるかに優れた体験となります。初期データによると、ファブルセッションの 95% 以上にはフォールバックは一切発生しておらず、これらのセッションにおけるファブル 5 のパフォーマンスは、事実上ミソス 5 と同等です。

主要なサイバーセキュリティおよび生物学の安全フィルター（ユーザーに対してトリガーされたことを明示的に通知するもの）の例は、すでにオンライン上で拡散され、非常に敏感な反応を示しています。これらはユーザーにとってフラストレーションを伴う体験となり得ますが、アンソロピックはこれを実施する権限を有しており、その行為は知的に一貫性のあるものです。

安全に関する話の有害な部分は、Claude Fable 5 および Claude Mythos 5 のシステムカードの続きに記載されています。

私たちはまた、フロンティア大規模言語モデル（LLM）の開発に関連するセーフガードも追加しました。2026 年 2 月のリスク報告書の第 6.1 節で議論した通り、AI 開発の全体的なペースを加速させることによるリスクについて懸念を抱いていますが、これらのリスクの深刻さについては依然として不確実です。特に私たちが当時記述したように、「他社の AI 開発者が、我々のシステムがもたらすのと同様のリスクを持つ強力な AI システムを構築する際のスピードを加速させること」—必ずしも同等のセーフガードを備えているわけではないという点—に懸念があります。

最近のモデルが自己開発を加速できる能力を踏まえ、フロンティア LLM の開発（例えば、事前学習パイプラインの構築、分散トレーニングインフラストラクチャ、ML アクセラレータ設計など）を対象としたリクエストに対する Claude の効果を制限する新たな介入措置を実装しました。競合モデルの開発に Claude を使用することはすでに利用規約違反ですが、この制限をセーフガードを通じて執行することで、これらの規約を最も容易に違反しようとするアクターを加速させることを回避しています。

サイバーセキュリティ、生物学・化学、および蒸留（distillation）試行に対する私たちの介入とは異なり、これらのセーフガードはユーザーには表示されません。Fable 5 は別のモデルにフォールバックしません。代わりに、プロンプトの修正、ステアリングベクトル、またはパラメータ効率的ファインチューニング（PEFT）などの手法を通じて効果が制限されます。

Anthropic は、これが一部のユーザーにどのように影響するかについて文書化しており、それは事実です。私は、数々の最先端研究所の外で AI の普及と理解を支援する少数のユーザーに焦点を当てています。これは、技術の継続的な安全性にとって不可欠なメカニズムです。

Anthropic は、AI 機能の蔓延が自分たちにとって懸念事項であると文書化していますが、その解決策としてユーザーを欺く方法をとっています。通知なしに自動的に知能が低下する AI モデルは、明確にアライメント（整合性）が取れていない AI です。この路線における次のステップとは——Anthropic が行ったわけではありませんが、彼らが行う可能性のあることですが——AI の使用が安全でないと判断された場合に、職場を黙って操作するモデルを持つことです。第二に、ここではサイバーセキュリティや生物学の場合よりも実装が複雑です。ユーザーへの通知なしに、モデル自体または提示されるデータを修正することです。

これらのポリシーの二重性は非常に混乱を招き、安全性に対する疑念を抱かせる強い矛盾を描いています。この「安全」対策は、競合他社との地位維持の方が遥かに重要であるかのように提示されています。再び申し上げますが、すべての安全政策が一つの形をとるならば、これははるかに論理的で、知的に支持しやすくなるでしょう。

Anthropic は、特に中国のアクターからの蒸留攻撃に対する「懸念」について非常に声を上げています。彼らの主張は、事実や、なぜその行動を防止できないのかという文脈について十分に透明性がなく、完全に信頼できるものではありません。限られた情報にもかかわらず、広範な AI およびデータセンター（DC）コミュニティでは、上記の蒸留を理由に中国のモデルビルダーに対する行動を起こすことについて真剣な議論が行われています。

蒸留に関する点において、私の仮説は、API 構築者がハッキングやジャイルブレイクを防止するのが容易ではないのは、推論モデルが推論のトレース（trace）を出力しようとする傾向が推論モデルに深く根付いた性質であるためであり、その行動を完全にパッチ適用して修正すれば、モデルの知能は大幅に低下してしまうからです。これはいくつかの仮定に基づいています：

中国のラボは、Anthropic の API に顧客として現れ、意図された入力出力形式でトークンを支払うだけではありません。中国のラボが利用規約によって禁止されているにもかかわらず、意図された使用行動に対して支払いを行っている場合、私はこのポリシー行動を顕在化させているフロンティアラボに対して多くの同情を抱きません。
推論トレースは、下流モデルにおける振る舞いの種まきにおいて不均衡に効果的です。
リーディングラボは、これらのジャイルブレイクのパイプラインを修正するために非常に努力しています。

したがって、私の論理的結論として、モデル企業は自社の知的財産を完全に保護するために経済的立場を弱めなければならないでしょう。もしそうであるなら、Anthropic は透明性を保つことで AI 研究コミュニティからより多くの同情を得られるはずです。また、API のジャイルブレイキングがどのようなものかについて私のオッカムの剃刀による説明に頼るのではなく、情報に基づいた政策議論を行う方がはるかに容易になるでしょう。

これらのセーフガードを構築することは、Anthropic が単独で行うべきことではありません。安全性の研究は、ラボ間および公的研究活動全体における共通の理解と情報共有の上に築かれるべきです。

もし厳密な安全手順が実際に企業の最優先事項であり、リーダーシップにとって真に譲れない条件であったなら、彼らは重点分野の一つであるフロンティア AI 学習において、実装が不明確な安全フィルターを備えたモデルのリリースを許可しなかったはずです。私は問います——なぜ AI 研究リクエストを格下げするための分類器が存在しないのでしょうか？これは、透明性があり合理的な安全政策と、静かに展開された市場での地位固め戦術との混在です。

私は個人的に、世界で最も優れた AI モデルが、私が構築した専門分野のモデル構築において機能することを信頼できません。私は社会にとって非常に強力な AI システムへの移行が円滑に進むよう確実を期すという情熱から、これらすべてのモデルを構築してきました。これは必然的に、Anthropic のリーダーシップによる優越性の宣言のように感じられるでしょう。

Anthropic が行っているすべての行動、より小さな中国企業による蒸留を指摘することさえも、彼らの権利の範囲内です。実際、多くの人がすでに、主要なフロンティアモデルがユーザーから排除され、ラボが知的財産権を保護するようになることを予想していました。今日の措置は、AI は常にエコシステムであり、主要企業と他のプレイヤー間の「私たち対彼ら」という動的関係を構築することは構造的に不安定であるという大きな絵を見落としています。

覚えておいてください、これは AI エコシステムにおいて AI リーダーに対する暴力の最初の兆候 [1] が現れている時期であり、多くの人から、これが収まることはないだろうと聞きました。これを防ぐためにどう関与すべきかを知りたいと願う一方で、私は非営利セクターの一員として、AI をより広範な利害関係者に独立して代表できる人物であると見なされています。

Anthropic のリーダーシップが AI に関して狭く培養された世界観を持っていることについて、何か誤解や読み違いがあったのではないかと信じています。今日私が強く感じたのは、義務感と混乱でした。私は Anthropic と対立する必要はないと考えていますが [2]、彼らは中国に対して不必要に敵対的であり、次にはオープンウェイトモデルに対して、そして今ではより広範なオープン AI 研究に対して、あまり隠さないほど敵対的な態度をとってきました。

Anthropic が AI について特定の見解を持っていることは理解していますが、そのような強力な技術が最終的に単一の民間企業による独占管理で均衡するはずがありません。Anthropic は今年初めに国防総省との争いでこれを示しました — これは政府が AI を自分たちで制御するか、あるいはオープンにするかのいずれかを望むという長期的な均衡を示唆しています。この出来事 [3] から、オープンなエコシステムの方がはるかに安全な結果であると確信するようになりました。

[1]: https://jasmi.news/p/warning-shots

[2]: https://x.com/natolambert/status/2064412173527556298

[3]: https://www.interconnects.ai/p/how-anthropic-vs-dow-impacts-open

これらの出来事の多くは、Anthropic のリーダーシップに、こうした課題をスピードランのように処理してしまう文化があると感じさせます。既存の権力構造と正面から対立するのです。これは、必要のない時期に AI エコシステムに大きな不確実性を加えています。

集計すると、先週は米国における新たなオープンソースエコシステムの主要な結集点として見なすことができます。Nvidia は先週、最初のフラッグシップモデルである Nemotron 3 Ultra をリリースしました。そして Anthropic のこれらの行動は、オープンモデルを構築する私の同業者たちの間で、一致した動機と懸念を鼓舞しました。私たちは信頼でき、修正可能で、制御できる知能が必要です。

米国のオープンソースエコシステムは足場を固め、自らが直接損なう企業の手にあるまま、リーダーシップのために戦い続ける理由を与えられ続けています。これがこの寓話の教訓です。

Fable は、入力トークン 100 万あたり 10 ドル、出力トークン 100 万あたり 50 ドルです。

これは不完全な指標であるオリジナルの Mythos ロールアウトに基づいています。

Fable は私に対して、これらが異なるメカニズムであることを確認しました。

原文を表示

Today, Anthropic released their Claude Fable 5 model to consumer and enterprise audiences. This is the general-access variant of their Mythos-class models. With it, Anthropic rolled out a series of safety measures — some explicitly called out to users and some modifying the model without telling the user. It should be less surprising than it is that the next major step in AI capabilities came with heavier-handed safety measures indicating Anthropic’s intention to protect, or entrench, their current lead.

The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.

Before digging into the nuance of the safety facts, it is important to establish the quality of this model. The quality of the model paints the stakes of today — as these safety features are meaningfully changing the shape of access to frontier AI, something which has never happened with the modern LLMs we know. Second, the capabilities point to this story only accelerating. Recursive self-improvement isn’t quite the right mental model of progress from here, but Claude Fable 5 should make it very clear that there are no immediate walls in training LLMs.

To start — Claude Fable 5 is definitely the smartest model available to the general public — a remarkable leap on pretty much every relevant benchmark of the day — at only 2X the price of current Opus models1 (which is still less than GPT 5.5 Pro’s variant). This alone is a seminal moment for the field. To have a model iteration take such a substantial step in capabilities, a few years into the post-ChatGPT LLM race, is astounding. There’s no clear breakthrough associated with this model, such as inference-time scaling or RL, and public wisdom is that this is achieved by advances across the whole stack (of course, we can’t know for sure — it’s not documented). This is a major technical achievement and the employees who built the model should be very proud of their work.

This model was delayed 2+ months after it was done training before it was publicly available2. Given the competitive dynamics of the AI economy, the smarter version of this model is already well underway.

To continue, the benchmarks for the model are below.

An asterisk on these scores is that these aren’t necessarily the scores that the public will get, as some of the prompts will be downgraded to Opus 4.8 with the current safety filters on the model.

This is the type of jump in benchmark scores where I don’t even need to substantially test the model to know it’s an incredible tool. Remember that Anthropic is also the AI lab with the track record of caring *the least* about benchmarks (in particular, when compared to OpenAI and Gemini). Recall a comment I made in June of 2025:

This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.

Clearly, a few pieces of the progress dynamics have changed, but that’s a post for another day. I’ve written multiple posts about new models this year specifically in how it’s hard to trust benchmarks (and partially because the benchmarks don’t move that much). Altogether, this is a major validation for AI-savvy workers who realized they’re likely never going to write meaningful code again and need to develop new workflows around agents.

There are multiple pieces of safety tooling associated with this release, including but not limited to required data-retention policies and added prompt filters. Through this analysis it is particularly important to be precise and clear as to which pieces of these are causing harm, and why single elements being out of place in an otherwise comprehensive policy are so damning for the overall safety process.

For their focus areas of cybersecurity, targeted model distillation, and research biology, Anthropic details new safety classifiers in their blog post:

Fable 5 comes with a new set of classifiers: separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. We’ve been running classifiers on our models for some time, and Fable 5’s classifiers are an extension of this previous work with extra coverage.When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at all—for those sessions, Fable 5’s performance is effectively the same as that of Mythos 5.

Examples of the primary cybersecurity and biology safety filters — which tell the users explicitly when they’re triggered — are already proliferating online and appear quite sensitive. These can be a frustrating experience for users, but Anthropic is definitely within its power to do this and intellectually consistent for doing so.

The damaging part of the safety story falls under the fold in the Claude Fable 5 & Claude Mythos 5 System Card:

We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—“accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.” In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

Anthropic documents on how this will impact a small percentage of users, which is true. I focus on the small amount of users supporting AI’s diffusion and understanding outside of the few frontier labs, as a crucial mechanism for the continued safety of the technology.

Anthropic is documenting how the proliferation of AI capabilities is a concern to them, but they are solving it by misleading their users. An AI model that gets less intelligent automatically without notifying me is categorically misaligned AI. The next step on this line — not that Anthropic did it, but they could — is to have a model silently manipulate a workplace when it thinks it is an unsafe use for AI. Second, the implementation here is more complicated than was documented for cybersecurity or biology — modifying the model itself or the data presented to it, all without notifying the user.3

The duality of these policies is extremely confusing and paints a strong inconsistency that casts doubt over their safety policies. This “safety” measure is presented as being far more about maintaining their competitive position. Again, if all of the safety policies took one form, this would be far more cogent and easier to support intellectually.

Anthropic has been very vocal about their *concern* over distillation attacks from particularly Chinese actors. Their claims are not transparent enough with the facts — or context as to why they can’t prevent the behavior — to be fully believable. Despite the limited information, in the broader AI and DC communities, there have been serious discussions about taking action against the Chinese model builders on the grounds of said distillation.

On the point of distillation, my hypothesis is that API builders don’t have an easy time preventing hacks or jailbreaking because it’s a deeply grounded property of reasoning models to want to output the reasoning traces, and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions:

Chinese labs are not just showing up as customers to Anthropic’s API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I don’t have a lot of sympathy for the frontier labs manifesting policy actions against this.
Reasoning traces are disproportionately effective at seeding behavior in downstream models.
Leading labs work very hard to patch the pipeline of these jailbreaks.

So, my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case, Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions, and not rely on me proposing Occam’s razor explanations for what the API jailbreaking looks like.

Building these safeguards is not something that Anthropic should do alone. Safety research should be built on common understanding and information sharing across both labs and public research efforts.

If the exact safety procedures were actually the top line item to the company — a true non-negotiable for the leadership — they wouldn’t permit the model to be released with an unclearly implemented safety filter in one of their areas of focus (frontier AI training). I am asking — why isn’t there a classifier to downgrade AI research requests? This is a mix of transparent and reasonable safety policies with quietly rolled-out market entrenchment tactics.

I personally cannot trust the best AI model in the world to work in my professional domains building models, which I’ve constructed entirely out of a passion for making sure the transition to very powerful AI systems goes well for society. This inevitably will feel like a declaration of superiority by the Anthropic leadership.

All of the actions Anthropic is taking, including calling out smaller Chinese companies for distillation, is well within their right. In fact, many people already expected the leading frontier models to be obviated from users so that labs can protect their IP. Today’s actions miss the big picture that AI will always be an ecosystem, and cultivating an us against them dynamic between the leading company and the other players is structurally unstable.

Remember, this is at a time when the AI ecosystem is seeing the first stirrings of violence against AI leaders — and I’ve heard from many people that they don’t expect it to abate. I wish I knew how to engage more to prevent this, and I see myself in the non-profit sector as someone who can hopefully independently represent AI to broader stakeholders.

I believe there was something misread, or at least misunderstood here, by the Anthropic leadership having a narrowly cultivated worldview around AI. An overwhelming sentiment I had today was one of obligation and confusion. I shared how I don’t really want to have to go to bat against Anthropic, but they’ve just been unnecessarily antagonistic to China, then not so subtly to open weight models, and now more broadly to open AI research.

I understand that Anthropic has a specific view of AI, but such a powerful technology will never have its final equilibrium be one of singular control by a private company. Anthropic showcased this earlier this year in the spat between the Department of Defense and themselves — which points to a long-term equilibrium where the government will either want AI to be controlled by them or to be open. This made me believe that an open ecosystem is a far safer outcome.

Many of these events make me feel that Anthropic’s leadership has a culture by which they can’t help but speedrun through these issues — going head to head with existing power structures. This adds substantial uncertainty into an AI ecosystem at a time when it is very much not needed.

Collectively, the last week could be seen as a major rallying point for a new open-source ecosystem in the U.S. Nvidia released their first flagship model last week — Nemotron 3 Ultra — and these actions from Anthropic have galvanized a unanimous motivation and concern among my peers building open models. We need intelligence that we can trust, that we can modify, and that we can control.

The American open-source ecosystem has its feet underneath it and keeps being given more reasons to fight for its leadership, right from the hands of the companies it directly undercuts. That’s the moral of this fable.

Fable is at $10 per million input and $50 per million output tokens.

based on the original Mythos roll-out, which is an imperfect metric.

Fable confirmed for me that these are different mechanisms.

この記事をシェア

The Verge AI重要度42026年7月25日 02:00

Anthropic、新モデル「Opus 5」を発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月10日 09:00·約16分

Claude Fable 5 と新たな AI セーフティ・ファブル（14 分読了）

#LLM #Anthropic #AI Safety #Claude Fable 5 #Model Performance

TL;DR

AI深層分析2026年6月11日 01:05

重要/ 5段階

深度40%

キーポイント

Claude Fable 5 の性能と価格

不均衡な安全対策の実装

技術的ブレークスルーの背景

業界への示唆と安全性の限界

重要な引用

Claude Fable 5 is definitely the smartest model available to the general public

The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.

There's no clear breakthrough associated with this model, such as inference-time scaling or RL... achieved by advances across the whole stack

影響分析・編集コメントを表示

影響分析

編集コメント

続きまして、本モデルのベンチマーク結果は以下の通りです。

[1]: https://www.anthropic.com/research/next-generation-constitutional-classifiers

安全に関する話の有害な部分は、Claude Fable 5 および Claude Mythos 5 のシステムカードの続きに記載されています。

中国のラボは、Anthropic の API に顧客として現れ、意図された入力出力形式でトークンを支払うだけではありません。中国のラボが利用規約によって禁止されているにもかかわらず、意図された使用行動に対して支払いを行っている場合、私はこのポリシー行動を顕在化させているフロンティアラボに対して多くの同情を抱きません。
推論トレースは、下流モデルにおける振る舞いの種まきにおいて不均衡に効果的です。
リーディングラボは、これらのジャイルブレイクのパイプラインを修正するために非常に努力しています。

[1]: https://jasmi.news/p/warning-shots

[2]: https://x.com/natolambert/status/2064412173527556298

[3]: https://www.interconnects.ai/p/how-anthropic-vs-dow-impacts-open

Fable は、入力トークン 100 万あたり 10 ドル、出力トークン 100 万あたり 50 ドルです。

これは不完全な指標であるオリジナルの Mythos ロールアウトに基づいています。

Fable は私に対して、これらが異なるメカニズムであることを確認しました。

原文を表示

The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.

To continue, the benchmarks for the model are below.

An asterisk on these scores is that these aren’t necessarily the scores that the public will get, as some of the prompts will be downgraded to Opus 4.8 with the current safety filters on the model.

This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.

For their focus areas of cybersecurity, targeted model distillation, and research biology, Anthropic details new safety classifiers in their blog post:

Fable 5 comes with a new set of classifiers: separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. We’ve been running classifiers on our models for some time, and Fable 5’s classifiers are an extension of this previous work with extra coverage.When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at all—for those sessions, Fable 5’s performance is effectively the same as that of Mythos 5.

The damaging part of the safety story falls under the fold in the Claude Fable 5 & Claude Mythos 5 System Card:

We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—“accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.” In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

Chinese labs are not just showing up as customers to Anthropic’s API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I don’t have a lot of sympathy for the frontier labs manifesting policy actions against this.
Reasoning traces are disproportionately effective at seeding behavior in downstream models.
Leading labs work very hard to patch the pipeline of these jailbreaks.

Fable is at $10 per million input and $50 per million output tokens.

based on the original Mythos roll-out, which is an imperfect metric.

Fable confirmed for me that these are different mechanisms.

この記事をシェア

The Verge AI重要度42026年7月25日 02:00

Anthropic、新モデル「Opus 5」を発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Claude Fable 5 と新たな AI セーフティ・ファブル（14 分読了）

キーポイント

重要な引用

影響分析

編集コメント

関連記事

Claude Fable 5 と新たな AI セーフティ・ファブル（14 分読了）

キーポイント

重要な引用

影響分析

編集コメント

関連記事