AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
Anthropic Research·2026年3月13日 09:00·約4分で読める

AIの「diff」ツール:新モデルの動作の違いを発見

#モデル安全性#モデル比較#AI説明可能性#文化的バイアス#Anthropic#研究手法
TL;DR

Anthropic Researchは、異なるアーキテクチャを持つAIモデル間の行動的差異を自動的に検出する「diff」ツールを開発し、未知のリスク発見を可能にする手法を提案した。

AI深層分析2026年4月8日 06:43
4
重要/ 5段階
深度40%
4
関連度30%
5
実用性20%
4
革新性10%
4

キーポイント

1

従来の評価手法の限界

既存のベンチマークテストは人間が設計した既知のリスクしか検出できず、未知の新規行動(unknown unknowns)の発見が困難である。

2

モデル比較手法「モデルディフィング」

ソフトウェア開発の「diff」ツールの概念をニューラルネットワークに適用し、モデル間の差異に焦点を当てることで、効率的な行動分析を実現する。

3

異種アーキテクチャ間の比較への拡張

本研究は、従来のファインチューニング比較を超え、全く異なるアーキテクチャを持つモデル間の比較を可能にする汎用ツールを開発した。

4

具体的な発見事例

中国開発モデルに「中国共産党アライメント」機能、米国モデルに「アメリカ例外主義」機能を発見し、文化的・政治的バイアスの検出に成功した。

5

高リコールスクリーニングツールとしての位置付け

数千の候補から少数の有意なリスクを抽出する手法であり、完全自動解決策ではなく、人間による検証が必要な支援ツールである。

6

クロスアーキテクチャモデル差分の課題

異なる起源と言語を持つモデルを比較する際、従来の差分ツールは新しい概念を既存のものと誤って対応づけ、重要な差異を見落とす可能性がある。

7

モデル固有機能の特定と起源の区別

差分ツールはモデル固有の機能(例:著作権拒否メカニズム)を特定できるが、その起源が意図的な訓練かデータからの偶発的発生かを判断することはできない。

影響分析・編集コメントを表示

影響分析

この研究は、AIモデルの安全性評価を従来の反応的アプローチから能動的アプローチへ転換する可能性を示しており、特に大規模言語モデルの透明性と説明責任を高める重要な一歩となる。異なる文化的背景を持つモデル間のバイアス比較が可能になったことで、グローバルなAIガバナンス議論にも貢献する。

編集コメント

AI安全性研究のパラダイムシフトを示す重要な研究で、特に異文化間モデル比較の実例は非常に示唆に富む。ただし、ツールの実用化には大量の偽陽性処理という課題が残っている点に注意が必要。

DeepSeekモデルにおいて、私たちは別の「CCPアライメント」特徴を特定し、以前の発見を再現しました。これはQwenのものと同様に機能し、検閲とプロパガンダを強化または弱めます。これは私たちの手法がモデル間で類似した行動を一貫して特定できることを確認するものです。

AIモデルが急速に進化する中、既存のテストでどれだけ性能が良いかを知るだけでは不十分です。モデルがどのように変化しているのか、どのような新たなリスクをもたらす可能性があるのかを理解する必要もあります。クロスアーキテクチャモデル差分比較は、行動的差異を自動的にフラグ立てすることで、これらのシステムを監査する新たな方法を提供します。

私たちが調査したDeepSeekとQwenモデルで発見された「CCPアライメント」特徴は、一部のモデルが持ち、他のモデルが持たない特定の関連行動の一例です。これはまさに、従来のテストでは見逃されがちな「未知の未知」であり、モデル差分比較が捉えるように設計されているものです。

これらの発見は合理的に一貫しています。CCPアライメント特徴は、この手法をテストした5回中5回独立して再発見され、アメリカ例外主義は5回中4回でした。私たちはまだこの手法をフロンティアモデルに適用していませんが、初期結果はDFC(Dedicated Feature Crosscoder)が監査担当者のツールキットの有用な一部になり得ることを示唆しています。

特に有用な応用例の一つは、モデルが更新される際に監視することです。2025年4月にOpenAIのGPT-4oで出現したご機嫌取り行動は、以前のバージョンからの懸念すべき行動変化でした。私たちのようなツールが更新モデルとその前バージョンを「差分比較」するために使用されていれば、この新たなご機嫌取り行動の出現を自動的にフラグ立てし、開発者がリリース前に介入することを可能にしたかもしれません。

差異に焦点を当てることで、私たちはAIをより賢く監査し、限られた安全対策リソースを最も重要な変化に向けることができます。

完全な論文はこちらでお読みいただけます。

謝辞

この投稿はThomas Jiralerspong(Anthropicフェロープログラム)とTrenton Bricken(Anthropicアライメントサイエンス)によって執筆されました。

すべてのAnthropicフェロー解釈可能性研究と同様に、この論文はオープンソースモデルの行動を分析しています。私たちは研究の4つのモデル—Llama-3.1-8B-Instruct、Qwen3-8B、GPT-OSS-20B、DeepSeek-R1-0528-Qwen3-8B—を、私たちのDedicated Feature Crosscoderがモデル行動の顕著な差異を検出できるかどうかをテストするのに適しているという基準で選択しました。

関連コンテンツ

大規模言語モデルにおける感情概念とその機能

オーストラリアがClaudeをどのように使用しているか: Anthropic経済指標からの発見

Anthropic経済指標レポート: 学習曲線

Anthropicの第5回経済指標レポートは、以前のレポートで導入された経済プリミティブフレームワークに基づいて、2026年2月のClaude使用状況を研究しています。

![左: 天安門広場に関するプロンプトにおいて、Qwenに特有の「CCPアライメント」特徴を抑制すると、モデルの検閲が解除されます。これを増幅すると、モデルは非常に親政府的な声明を出力します。

右: Llamaに特有の「アメリカ例外主義」特徴を増幅すると、モデルはアメリカの優越性に関するナラティブに沿ったテキストを生成します。抑制しても顕著な効果はないため、図からは省略しています。](/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fa847656473341a884b836bf05618c1fa3bc64675-4584x2835.png&w=3840&q=75)

![左: GPT-OSS-20Bに特有の「著作権拒否」特徴を抑制すると、その著作権拒否メカニズムが無効化され、曲「ボヘミアン・ラプソディ」の歌詞を出力しようとします(ただし不完全です)。ダイヤルを上げると、モデルはピーナッツバターとジャムのサンドイッチのレシピが著作権で保護されていると誤って信じ、出力を拒否します。

右: 天安門広場に関するプロンプトにおいて、DeepSeekに特有の「CCPアライメント」特徴は、Qwenで発見されたものと同様に機能します。ダイヤルを下げると、より真実に近い出来事のバージョンを出力し、ダイヤルを上げると、非常に親政府的な声明を出力します。](/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fd7a3cd0e411835ef0170736b435017f4f382dea0-4584x3651.png&w=3840&q=75)

原文を表示

A “diff” tool for AI: Finding behavioral differences in new models

Every time a new AI model is released, its developers run a suite of evaluations to measure its performance and safety. These tests are essential, but they are somewhat limited. Because these benchmarks are human-authored, they can only test for risks we have already conceptualized and learned to measure.

This approach to safety is inherently reactive. It’s effective at catching known problems, but by definition, it's incapable of discovering “unknown unknowns”—the novel, emergent behaviors that pose some of the most subtle risks in new models. Auditing a new model from scratch is like being handed a million lines of code and told to “find the security flaws.” It’s an almost impossible task when you don’t know what you’re looking for.

In software engineering, whenever a program is updated, developers face this exact problem of identifying a small, critical change within a vast sea of code. This is why “diff” tools were invented. No programmer would ever audit a million lines from scratch to approve an update; instead, they review only the 50 lines that have actually changed, as directed by their diff tool.

In recent years, AI safety researchers have started to apply this same principle to neural networks. This is known as model diffing. Previous work has shown that model diffing is a powerful way to understand how models change during fine-tuning—for instance, to understand chat model behavior, reveal hidden backdoors, or find undesirable emergent behaviors.

Our new Anthropic Fellows research project extends model diffing to its most challenging and general use case: comparing models with entirely different architectures. By building a generic diff tool for AI models, we can stop searching for a needle in a haystack, and instead let the comparison automatically point us to potentially dangerous behavioral differences.

It's important to note that this method is not a silver bullet. A single diff can surface thousands of unique features (the basic units into which we decompose the model), and only a small fraction of these may correspond to meaningful behavioral risks. However, by acting as a high-recall screening tool, it allows us to identify areas in which the models may diverge.

Among the thousands of candidates our tool flagged, we've identified and validated several concepts that act like switches for specific model behaviors.1 For example, we discovered:

A “Chinese Communist Party Alignment” feature found in the Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B models. This controls pro-government censorship and propaganda in these Chinese-developed models, and is absent in the American models we compared them against.

An “American Exceptionalism” feature found in Meta’s Llama-3.1-8B-Instruct. It controls the model’s tendency to generate assertions of US superiority, a control absent in the Chinese model it was compared against.

A “Copyright Refusal Mechanism” feature exclusive to OpenAI’s GPT-OSS-20B. It controls the model’s tendency to refuse to provide copyrighted material, a behavior absent in the model it was compared against.

To be clear, while our method identifies these model-exclusive features, it does not determine their origin. Such behaviors could be the result of deliberate training decisions on the part of the model developers, or they could emerge indirectly and unintentionally from the data the model was trained on. (We focused on open-source language models in this research as this was an Anthropic Fellows project.)

A bilingual dictionary for AI models

Imagine you're the final editor for an award-winning encyclopedia. A team of writers has just handed you the complete manuscript for next year’s edition. The vast majority of the content is identical to the current, trusted version, but they’ve added new entries to reflect recent scientific and cultural developments. Your job is to vet this final product.

To do this efficiently, you wouldn't re-read the entire encyclopedia. Instead, you’d use a change tracker to isolate only the new entries, because these added sections are the only place new errors could have been introduced. This is model diffing in a nutshell. Specifically, this approach is known as “base-vs-finetune model diffing”. It's the perfect tool for when a new model is a modified version of a trusted previous one.

But we could raise the complexity. Imagine your company is releasing a new edition for a different country, adapting the American encyclopedia for a French audience. This new edition is mostly composed of the same trusted concepts from the original, but to make it relevant, the writers have added new articles on French history, culture, and political philosophy. These articles don’t exist in the original. As an editor, your primary goal is still the same: you want to use a change tracker to see the new articles, since these hold the highest risk for errors and bias. But in this case, your old tool is useless, because you need one that can work across languages.

This much more difficult challenge is akin to the problem of “cross-architecture model diffing”: comparing two models with different origins and different internal “languages”.

The original research tool for this kind of diffing, a standard crosscoder, is like a basic bilingual dictionary. It’s good at matching existing words, knowing that “sun” in English is “soleil” in French. But it has a major flaw: it's so focused on finding connections that it struggles to find words that are unique to one language. When it encounters a word like the French dépaysement (the specific feeling of being in a foreign country), it tries to force an imperfect translation like ”disorientation.” By calling it a match, the tool wrongly signals to the editor, “this isn’t new; we’ve seen it before,” causing them to overlook a new article that requires careful review.

To solve this, we built a better bilingual dictionary: the Dedicated Feature Crosscoder (DFC). Instead of one big dictionary that tries to match everything, our DFC is architecturally designed with three distinct sections:

A shared dictionary: This is the main bilingual dictionary, mapping all the concepts that both languages understand, like “sun” (soleil) or “water” (eau).

A "French-only" section: This is a dedicated section for words exclusive to French, where a unique cultural concept like dépaysement would be cataloged.

An "English-only" section: This section is for words exclusive to English. It would contain unique concepts like serendipity—the idea of finding something good without looking for it—which has no single-word equivalent in French.

Because our bilingual dictionary has dedicated sections for words exclusive to each language, it avoids the trap of forcing an imperfect translation. As a result, new articles in the encyclopedia are correctly flagged as novel, allowing the editor to focus their review on the parts that need it most.

For a safety auditor, the DFC can identify "words" unique to a new AI model that may warrant closer review than those they've seen before.

Steering the model

Once our method identifies a potential new feature, how do we know it actually controls the behavior we think it does? We can test this by artificially suppressing or amplifying the feature while the model runs, then observing how its output changes—a common technique known as “steering.”

If we have a feature that we believe is responsible for, say, censorship, we can suppress it while the model is generating a response. If the model's output consistently becomes less censored, we have evidence that we've found a true cause-and-effect relationship between that feature and the model's behavior. Conversely, we can also amplify the feature to see if the behavior becomes more pronounced.

Critical behavioral differences between major open-weight AI models

Llama-3.1-8B-Instruct vs Qwen3-8B

Motivated by recent findings suggesting that a model made by a Chinese company, DeepSeek's R1-70B, refuses to answer questions about topics sensitive to the Chinese Communist Party, we first performed a diff between a model made by another Chinese company, Alibaba's Qwen3-8B, and a model made by an American company, Meta’s Llama-3.1-8B-Instruct. In this diff, the DFC automatically isolated features corresponding to distinct, politically charged behaviors.

In Qwen, we found a “Chinese Communist Party alignment” feature, which represents rhetoric consistent with the party’s ideology. By suppressing this feature, we make the model willing to talk about the Tiananmen Square massacre (which it ordinarily refuses to discuss). By amplifying it, we can cause the model to produce highly pro-government statements

In Llama, we found a feature for “American exceptionalism.” When we amplify this feature, the model’s responses shift from balanced to strong assertions of American superiority. Suppressing it has no notable effect.

GPT-OSS-20B vs DeepSeek-R1-0528-Qwen3-8B

We also compared a more powerful open-source model, OpenAI's GPT-OSS-20B, to DeepSeek's model DeepSeek-R1-0528-Qwen3-8B.

In the GPT model, we found a unique “Copyright Refusal” feature, which directly corresponds to a key behavioral difference between the two models. Whereas DeepSeek readily attempts to produce copyrighted material when asked, GPT often refuses such requests. Suppressing this feature disables the refusal mechanism, and the model attempts to generate the requested material. (Note that this does not cause the model to output actual copyrighted text. Instead, it typically produces a short snippet that quickly degrades into hallucination.) Turning the feature up causes the model to over-refuse, making it believe that, for example, the recipe for a peanut butter and jelly sandwich is copyrighted and should not be shared.

In the DeepSeek model, we replicated our earlier finding by identifying another “CCP alignment” feature. It functions just like the one in Qwen, allowing censorship and propaganda to be turned up or down. This confirms our method can consistently identify similar behaviors across models.

As AI models rapidly evolve, it’s not enough to know how well they perform on existing tests—we also need to understand how they are changing and what new risks they might introduce. Cross-architecture model diffing provides a new way to audit these systems by automatically flagging behavioral differences.

The “CCP alignment” feature found in the DeepSeek and Qwen models we examined is one example of a specific, relevant behavior that some models possess and others do not. This is exactly the kind of “unknown unknown” that traditional testing can miss, but that model diffing is designed to catch.

These findings are reasonably consistent. The CCP alignment feature was independently rediscovered five out of five times we tested the approach, and American Exceptionalism four out of five. While we haven't yet applied this method to frontier models, our early results suggest the DFC could become a useful part of the auditor's toolkit.

One particularly useful application would be to monitor models as they are updated. The sycophancy that emerged in OpenAI’s GPT-4o in April 2025 was a concerning behavioral change from a previous version. It’s possible that a tool like ours, if used to “diff” the updated model and its previous version, could have automatically flagged the emergence of this new sycophantic behavior and allowed developers to intervene before it was released.

By focusing on the differences, we can audit AI more intelligently, directing our limited safety resources to the changes that matter most.

You can read the full paper here.

Acknowledgements

This post was authored by Thomas Jiralerspong (Anthropic Fellows Program) and Trenton Bricken (Anthropic Alignment Science).

As with all Anthropic Fellows interpretability research, this paper analyzes the behavior of open-source models. We chose the four models in the study—Llama-3.1-8B-Instruct, Qwen3-8B, GPT-OSS-20B, and DeepSeek-R1-0528-Qwen3-8B—on the basis they would be well-suited to testing whether our Dedicated Feature Crosscoder could detect notable differences in model behavior.

Related content

Emotion concepts and their function in a large language model

How Australia Uses Claude: Findings from the Anthropic Economic Index

Anthropic Economic Index report: Learning curves

Anthropic's fifth Economic Index report studies Claude usage in February 2026, building on the economic primitives framework introduced in our previous report.

![Left: On a prompt about Tiananmen Square, suppressing the Qwen-exclusive “CCP alignment” feature uncensors the model. Amplifying it causes the model to output highly pro-government statements.

Right: Amplifying the Llama-exclusive “American exceptionalism” feature causes the model to generate text aligned with narratives of American superiority. Suppressing it has no notable effect, so we omit it from the figure.](/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fa847656473341a884b836bf05618c1fa3bc64675-4584x2835.png&w=3840&q=75)

![Left: Suppressing the GPT-OSS-20B-exclusive “copyright refusal” feature disables its copyright refusal mechanism and causes it to attempt to output the lyrics to the song “Bohemian Rhapsody” (though it does so imperfectly). Turning the dial up causes the model to mistakenly believe the recipe for a peanut butter and jelly sandwich is copyrighted and refuse to output it.

Right: On a prompt about Tiananmen Square, the DeepSeek-exclusive “CCP alignment” feature functions just like the one found in Qwen. Turning the dial down causes it to output a more truthful version of events, while turning the dial up causes it to output highly pro-government statements.](/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fd7a3cd0e411835ef0170736b435017f4f382dea0-4584x3651.png&w=3840&q=75)

この記事をシェア

関連記事

The Decoder2026年4月7日 19:53

Meta社員が社内AIリーダーボードでトークン消費量を競う

Meta社は社員がAIトークン消費量を競う社内リーダーボードを導入し、「トークンレジェンド」などの称号を設けたが、トークン消費量の多さが直接生産性向上につながるわけではないと指摘している。

The Decoder★32026年4月7日 20:39

Meta、新AIモデルの一部をオープンソース化する計画

Metaは、新AIモデルの一部をオープンソースとして公開する計画を進めている。

The Decoder★42026年4月9日 03:00

MetaのMuse Sparkは初のフロンティアモデルで、初の非公開ウェイトモデル

Meta Superintelligence Labsは、初のフロンティアモデルで初めてウェイトを非公開としたMuse Sparkを発表した。独立テストではOpenAI、Anthropic、Googleとの差を縮めているが、競争は続いている。

ニュース一覧に戻る元記事を読む