Claude Fable 5 と新たな AI セーフティ・ファブル(14 分読了)
Anthropic が一般公開向けに「Claude Fable 5」を発表し、性能が劇的に向上する一方で、ユーザーへの明示的告知なしを含む厳格な安全対策を同時に導入したことは、AI 業界における安全性と能力開発のバランスに関する重要な転換点を示している。
キーポイント
Claude Fable 5 の性能と価格
一般公開されているモデルの中で最も賢く、あらゆるベンチマークで大幅な進歩を遂げたが、現在の Opus モデルの約 2 倍という比較的高い価格設定となっている。
不均衡な安全対策の実装
ユーザーに明示的に通知される措置と、モデル内部で無言のうちに適用される措置が混在する「不均一な」安全ポリシーが導入され、これが業界の教訓となる可能性が指摘されている。
技術的ブレークスルーの背景
推論時間のスケーリングや RL などの明確な新手法ではなく、スタック全体(データ、アーキテクチャなど)にわたる進歩によって性能向上が達成されたと分析されている。
業界への示唆と安全性の限界
Anthropic が現在のリードを維持・強化するために「より強硬な」安全対策を採用した背景には、AI 能力開発における壁がすぐには存在しないという現実がある。
影響分析・編集コメントを表示
影響分析
この記事は、AI モデルの開発において「安全性」と「能力」をどう両立させるかという根本的な課題が、単なる技術的問題から戦略的・倫理的なジレンマへと変化していることを示唆しています。特に、ユーザーに知らされずに適用される安全対策の導入は、透明性の欠如や規制の不均衡に対する業界全体の警戒感を高める要因となるでしょう。
編集コメント
性能の飛躍的向上と、その影で静かに強化される安全規制のせめぎ合いが、今後の AI 開発の行方を占う重要な分岐点となっています。透明性を欠く安全対策の実装は、ユーザー信頼や業界標準に長期的な影響を与える可能性があります。
Today, Anthropic released their Claude Fable 5 model to consumer and enterprise audiences. This is the general-access variant of their Mythos-class models. With it, Anthropic rolled out a series of safety measures — some explicitly called out to users and some modifying the model without telling the user. It should be less surprising than it is that the next major step in AI capabilities came with heavier-handed safety measures indicating Anthropic’s intention to protect, or entrench, their current lead.
The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.
Before digging into the nuance of the safety facts, it is important to establish the quality of this model. The quality of the model paints the stakes of today — as these safety features are meaningfully changing the shape of access to frontier AI, something which has never happened with the modern LLMs we know. Second, the capabilities point to this story only accelerating. Recursive self-improvement isn’t quite the right mental model of progress from here, but Claude Fable 5 should make it very clear that there are no immediate walls in training LLMs.
To start — Claude Fable 5 is definitely the smartest model available to the general public — a remarkable leap on pretty much every relevant benchmark of the day — at only 2X the price of current Opus models1 (which is still less than GPT 5.5 Pro’s variant). This alone is a seminal moment for the field. To have a model iteration take such a substantial step in capabilities, a few years into the post-ChatGPT LLM race, is astounding. There’s no clear breakthrough associated with this model, such as inference-time scaling or RL, and public wisdom is that this is achieved by advances across the whole stack (of course, we can’t know for sure — it’s not documented). This is a major technical achievement and the employees who built the model should be very proud of their work.
This model was delayed 2+ months after it was done training before it was publicly available2. Given the competitive dynamics of the AI economy, the smarter version of this model is already well underway.
To continue, the benchmarks for the model are below.
An asterisk on these scores is that these aren’t necessarily the scores that the public will get, as some of the prompts will be downgraded to Opus 4.8 with the current safety filters on the model.
This is the type of jump in benchmark scores where I don’t even need to substantially test the model to know it’s an incredible tool. Remember that Anthropic is also the AI lab with the track record of caring *the least* about benchmarks (in particular, when compared to OpenAI and Gemini). Recall a comment I made in June of 2025:
This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
Clearly, a few pieces of the progress dynamics have changed, but that’s a post for another day. I’ve written multiple posts about new models this year specifically in how it’s hard to trust benchmarks (and partially because the benchmarks don’t move that much). Altogether, this is a major validation for AI-savvy workers who realized they’re likely never going to write meaningful code again and need to develop new workflows around agents.
There are multiple pieces of safety tooling associated with this release, including but not limited to required data-retention policies and added prompt filters. Through this analysis it is particularly important to be precise and clear as to which pieces of these are causing harm, and why single elements being out of place in an otherwise comprehensive policy are so damning for the overall safety process.
For their focus areas of cybersecurity, targeted model distillation, and research biology, Anthropic details new safety classifiers in their blog post:
Fable 5 comes with a new set of classifiers: separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. We’ve been running classifiers on our models for some time, and Fable 5’s classifiers are an extension of this previous work with extra coverage.When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at all—for those sessions, Fable 5’s performance is effectively the same as that of Mythos 5.
Examples of the primary cybersecurity and biology safety filters — which tell the users explicitly when they’re triggered — are already proliferating online and appear quite sensitive. These can be a frustrating experience for users, but Anthropic is definitely within its power to do this and intellectually consistent for doing so.
The damaging part of the safety story falls under the fold in the Claude Fable 5 & Claude Mythos 5 System Card:
We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—“accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.” In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
Anthropic documents on how this will impact a small percentage of users, which is true. I focus on the small amount of users supporting AI’s diffusion and understanding outside of the few frontier labs, as a crucial mechanism for the continued safety of the technology.
Anthropic is documenting how the proliferation of AI capabilities is a concern to them, but they are solving it by misleading their users. An AI model that gets less intelligent automatically without notifying me is categorically misaligned AI. The next step on this line — not that Anthropic did it, but they could — is to have a model silently manipulate a workplace when it thinks it is an unsafe use for AI. Second, the implementation here is more complicated than was documented for cybersecurity or biology — modifying the model itself or the data presented to it, all without notifying the user.3
The duality of these policies is extremely confusing and paints a strong inconsistency that casts doubt over their safety policies. This “safety” measure is presented as being far more about maintaining their competitive position. Again, if all of the safety policies took one form, this would be far more cogent and easier to support intellectually.
Anthropic has been very vocal about their *concern* over distillation attacks from particularly Chinese actors. Their claims are not transparent enough with the facts — or context as to why they can’t prevent the behavior — to be fully believable. Despite the limited information, in the broader AI and DC communities, there have been serious discussions about taking action against the Chinese model builders on the grounds of said distillation.
On the point of distillation, my hypothesis is that API builders don’t have an easy time preventing hacks or jailbreaking because it’s a deeply grounded property of reasoning models to want to output the reasoning traces, and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions:
- Chinese labs are not just showing up as customers to Anthropic’s API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I don’t have a lot of sympathy for the frontier labs manifesting policy actions against this.
- Reasoning traces are disproportionately effective at seeding behavior in downstream models.
- Leading labs work very hard to patch the pipeline of these jailbreaks.
So, my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case, Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions, and not rely on me proposing Occam’s razor explanations for what the API jailbreaking looks like.
Building these safeguards is not something that Anthropic should do alone. Safety research should be built on common understanding and information sharing across both labs and public research efforts.
If the exact safety procedures were actually the top line item to the company — a true non-negotiable for the leadership — they wouldn’t permit the model to be released with an unclearly implemented safety filter in one of their areas of focus (frontier AI training). I am asking — why isn’t there a classifier to downgrade AI research requests? This is a mix of transparent and reasonable safety policies with quietly rolled-out market entrenchment tactics.
I personally cannot trust the best AI model in the world to work in my professional domains building models, which I’ve constructed entirely out of a passion for making sure the transition to very powerful AI systems goes well for society. This inevitably will feel like a declaration of superiority by the Anthropic leadership.
All of the actions Anthropic is taking, including calling out smaller Chinese companies for distillation, is well within their right. In fact, many people already expected the leading frontier models to be obviated from users so that labs can protect their IP. Today’s actions miss the big picture that AI will always be an ecosystem, and cultivating an us against them dynamic between the leading company and the other players is structurally unstable.
Remember, this is at a time when the AI ecosystem is seeing the first stirrings of violence against AI leaders — and I’ve heard from many people that they don’t expect it to abate. I wish I knew how to engage more to prevent this, and I see myself in the non-profit sector as someone who can hopefully independently represent AI to broader stakeholders.
I believe there was something misread, or at least misunderstood here, by the Anthropic leadership having a narrowly cultivated worldview around AI. An overwhelming sentiment I had today was one of obligation and confusion. I shared how I don’t really want to have to go to bat against Anthropic, but they’ve just been unnecessarily antagonistic to China, then not so subtly to open weight models, and now more broadly to open AI research.
I understand that Anthropic has a specific view of AI, but such a powerful technology will never have its final equilibrium be one of singular control by a private company. Anthropic showcased this earlier this year in the spat between the Department of Defense and themselves — which points to a long-term equilibrium where the government will either want AI to be controlled by them or to be open. This made me believe that an open ecosystem is a far safer outcome.
Many of these events make me feel that Anthropic’s leadership has a culture by which they can’t help but speedrun through these issues — going head to head with existing power structures. This adds substantial uncertainty into an AI ecosystem at a time when it is very much not needed.
Collectively, the last week could be seen as a major rallying point for a new open-source ecosystem in the U.S. Nvidia released their first flagship model last week — Nemotron 3 Ultra — and these actions from Anthropic have galvanized a unanimous motivation and concern among my peers building open models. We need intelligence that we can trust, that we can modify, and that we can control.
The American open-source ecosystem has its feet underneath it and keeps being given more reasons to fight for its leadership, right from the hands of the companies it directly undercuts. That’s the moral of this fable.
Fable is at $10 per million input and $50 per million output tokens.
based on the original Mythos roll-out, which is an imperfect metric.
Fable confirmed for me that these are different mechanisms.
関連記事
Claude Fable の制限がユーザーに通知されない件について(3 分読了)
Anthropic は、競合他社が Claude を使用してモデルを開発する際などに効果を抑える新たな介入措置を講じました。この対策はユーザーには表示されず、Fable 5 が別のモデルに切り替わることもなく、プロンプトの修正やパラメータ効率化微調整を通じて効果を制限します。
Claude Fable 5 の初回インプレッション
Simon Willison は Anthropic が発表した最新モデル「Claude Fable 5」を約 5.5 時間テストし、処理能力が非常に高い一方で速度が遅く高価であると評価した。
ジェレミー・ハワード氏への引用:AI の自己改善を抑制する提案
ジェレミー・ハワード氏は、最先端モデルを開発するラボがその技術を自らの研究に使用しないよう合意し、他社にはアクセスを認めることで、危険な権力格差を防ぎつつ AI 進化を抑制する解決策を提案した。
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み