Jay Alammar·2021年1月19日 09:00·約8分

言葉を見つけること：言語モデルの隠れ状態可視化

#LLM #モデル解釈可能性 #Transformer #可視化 #オープンソース #研究ツール

TL;DR

Jay Alammar氏は、Transformer言語モデルの隠れ状態を可視化するオープンソースパッケージ「Ecco」を紹介し、モデルの内部処理過程を理解するための新しい手法を提案している。

AI深層分析2026年2月27日 22:53

重要/ 5段階

深度40%

キーポイント

隠れ状態の可視化によるモデル理解

Transformer言語モデルの各層の隠れ状態を可視化することで、モデルが特定の出力トークンを生成するまでの「思考プロセス」を推測できる手法を提案している。

Eccoオープンソースパッケージの公開

モデルの透明性を高めるためのインタラクティブな可視化ツール「Ecco」をオープンソースとして公開し、研究者や開発者が利用できるようにしている。

内部状態の投影によるスコア進化の分析

最終層だけでなく、各層の隠れ状態を出力語彙に投影することで、どの層が特定の出力トークンのスコア上昇に最も寄与したかを分析できる方法を示している。

先行研究との連携

Voita et al.やNostalgebraistの研究を引用しながら、トークンランキング、ロジットスコア、ソフトマックス確率の進化を可視化する手法を発展させている。

隠れ状態の進化を可視化する方法

出力トークン選択後の各層の隠れ状態を再検証し、各層がそのトークンをどのようにランク付けしたかを可視化する手法が紹介されている。

特定の構文特性の早期認識

改行やピリオドなどのトークンはLayer #0の段階で確信を持って予測されており、モデルが早期に特定の構文特性を認識していることが示されている。

エラー分析とトークンサンプリング方法

モデルがチリを誤ってリストした事例では、そのトークンのランキングが43位であったことから、エラーの原因はモデル自体ではなくトークンサンプリング方法にある可能性が示唆されている。

影響分析・編集コメントを表示

影響分析

この記事は、ブラックボックス化しがちな大規模言語モデルの内部処理を可視化する実用的な手法を提供しており、AIの透明性と説明責任を高める重要な一歩となる。特に、オープンソースツールとして公開されている点で、広範な研究コミュニティへの影響が期待される。

編集コメント

技術的に高度な内容を分かりやすく解説しており、AI研究の透明性向上に貢献する実用的な記事。オープンソースツールの公開により、実際の研究現場での活用が期待できる。

言葉を見つける：言語モデルの隠れ状態可視化

モデルの層間の隠れ状態を可視化することで、モデルの「思考プロセス」についての手がかりを得ることができます。

パート2： トランスフォーマー言語モデルをより透明にする追求を続け、本記事では、事前学習済み言語モデル内部の言語生成メカニズムを明らかにする一連の可視化を紹介します。これらの可視化はすべて、私たちが公開しているオープンソースパッケージ「Ecco」を使用して作成されています。本シリーズの最初のパート「Interfaces for Explaining Transformer Language Models」では、入力サリエンシーとニューロン活性化のためのインタラクティブなインターフェースを紹介しました。本記事では、モデルの層から次の層へと進化する隠れ状態に焦点を当てます。各トランスフォーマーデコーダーブロックが生成する隠れ状態を見ることで、言語モデルが特定の出力トークンにどのように到達したかについての情報を得ることを目指します。この方法はVoitaらによって探求されています。Nostalgebraistは、モデルの様々な層を通じて進化する隠れ状態について、トークンのランキング、ロジットスコア、ソフトマックス確率の進化を示す説得力のある視覚的表現を提示しています。

復習：トランスフォーマーの隠れ状態

以下の図は、トランスフォーマー言語モデルがどのように機能するかを復習します。層が最終的な隠れ状態をどのように生み出すか。そしてその最終状態がどのように出力語彙空間に投影され、モデルの語彙に含まれる各トークンにスコアが割り当てられるか。ここでは、DistilGPT2に入力シーケンス「 1, 1, 」を与えたときの上位スコアトークンを見ることができます：

この入力文字列を完成させるために1トークンを生成する

output = lm.generate(" 1, 1, 1,", generate=1)

可視化する

output.layer_predictions(position=6, layer=5)

各層後のスコア

モデルの内部隠れ状態に同じ投影を適用することで、入力の処理を通じて出力スコアに対するモデルの確信がどのように発展したかの見通しを得られます。内部隠れ状態のこの投影は、特定の潜在的な出力トークンのスコア（ひいてはランキング）を上げるのにどの層が最も貢献したかの感覚を与えてくれます。

隠れ状態の進化を見るということは、最終的なモデル状態を投影した候補出力トークンだけを見るのではなく、モデルの6つの層それぞれから生じる隠れ状態を投影した後の上位スコアトークンを見ることができることを意味します。

各層の後の上位スコアトークンを可視化する

output.layer_predictions(position=6)

選択されたトークンの進化

進化する隠れ状態に関する別の視覚的視点は、出力トークンを選択した後で隠れ状態を再検討し、各層後の隠れ状態がそのトークンをどのようにランク付けしたかを見ることです。これはNostalgebraistによって探求された多くの視点の一つであり、私たちが最初のアプローチとして優れていると考えるものです。横の図では、トークン「 1」のランキング（モデルの語彙にある+50,0000トークン中）を見ることができ、各行はある層の出力を示しています。

同じ可視化は、生成されたシーケンス全体に対してプロットすることができ、各列は生成ステップ（とその出力トークン）を、各行は各層での出力トークンのランキングを示します：

GPT2-Largeに以下の入力を与えることで、この可視化を実演してみましょう：

隠れ状態の進化を可視化することは、以下の図で見られるように、様々な層がこのシーケンスの生成にどのように貢献するかを明らかにします：

図：出力シーケンスの隠れ状態進化 画像をクリックしてフル解像度で開きます。この図は以下を明らかにしています：改行とピリオドに対応する濃いピンク色の列。層#0以降、モデルはこれらのトークンについて早期に確信しており、層#0が特定の構文的特性を認識している（そして後の層は異議を唱えない）ことを示唆しています。

国名が予測される列は上部が非常に明るく、適切なトークンを実際に思いつくのは最後の5つの層次第です。

増加する数字を追跡する列は、層#9で解決される傾向があります。

モデルは誤ってチリをリストに挙げています（EUの国ではありません）。しかし、そのトークンのランキングが43であることに注意してください――この誤りは、モデル自体というよりも、私たちのトークンサンプリング方法に起因する可能性が高いことを示しています。他のすべての国の場合は正しく、上位3位以内でした。

チリを除いて、残りの国は正しく、また入力シーケンスで使われたアルファベット順にも従っています。

他のトークンのランキング

特定の位置について、選択された1つのトークンの進化だけを見ることに限定されません。モデルが選択したかどうかに関わらず、同じ位置にある複数のトークンのランキングを比較したい場合があります。

そのようなケースの一つが、Linzenらによって記述された数予測タスクで、これは英語の主語と動詞の一致という現象から生じます。そのタスクでは、モデルが構文的な数（扱っている主語が単数か複数か）と構文的な主語性（文中のどの主語を扱っているか）をエンコードする能力を分析したいと考えます。

簡単に言えば、空白を埋める問題です。唯一許容される答えは1) is 2) areです：

The keys to the cabinet ______

正しく答えるには、まず、鍵（可能性のある主語#1）について述べているのか、キャビネット（可能性のある主語#2）について述べているのかを決定しなければなりません。それが鍵であると決めたら、次にそれが単数か複数かを決定します。

最初の質問に対するあなたの答えを、以下のバリエーションと比較してください：

The key to the cabinets ______

このセクションの図は、トークン「 is」と「 are」の隠れ状態進化を可視化しています。セル内の数字は、空白の位置におけるそれらのランキングです（両方の列はシーケンス内の同じ位置を扱っており、前の可視化のように連続した位置ではありません）。

最初の図（「The keys to the cabinet」シーケンスのランキングを示す）は、なぜ5つの層がタスクに失敗し、最終層だけが正しく設定するのかという疑問を提起します。これはおそらく、BERTで観察された、最終層が最もタスク特異的であるという効果と似ています。また、このタスクを成功させる能力が主に層5に局在しているのか、あるいは層が、主語と動詞の一致に特に敏感な、複数の層にまたがる回路における最終的な表現に過ぎないのかを調査する価値があります。

バイアスの調査

この方法は、バイアスの問題と、それがモデルのどこで生じる可能性があるかを明らかにすることができます。例えば、以下の図は、異なる職業に関連するモデルの性別期待を調査しています：

図：職業と性別の関連におけるモデルのバイアス調査 - 医師と看護師 最初の5つの層は、両方の職業で「 man」を「 woman」よりも高いランクに位置付けています。看護師という職業については、最終層が決定的に「 woman」を「 man」よりも高いランクに引き上げています。

文脈化された単語埋め込み（これまで「隠れ状態」と呼んできたベクトルの別名）におけるバイアスのより体系的でニュアンスのある調査は、で見つけることができます。

出力トークンスコア

選択されたトークンの進化

トークンランキングの比較

入力: "Heathrow airport is located in the city of"

モデル: DistilGPT2

入力: "Some of the most glorious historical attractions in Spain date from the period of Muslim rule, including The Mezquita, built as the Great Mosque of Cordoba and the Medina Azahara, also in Cordoba and now in ruins but still visitable as such and built as the Madinat al-Zahra, the Palace of al-Andalus; and the Alhambra in Granada, a splendid, intact palace. There are also two synagogues still standing that were built during the era of Muslim Spain: Santa Maria la Blanca in Toledo and the Synagogue of Cordoba, in the Old City. Reconquista and Imperial era"

モデル: DistilGPT2

モデル: GPT2-Large

入力: "The countires of the European Union are"

原文を表示

Finding the Words to Say: Hidden State Visualizations for Language Models

By visualizing the hidden state between a model's layers, we can get some clues as to the model's "thought process".

Part 2: Continuing the pursuit of making Transformer language models more transparent, this article showcases a collection of visualizations to uncover mechanics of language generation inside a pre-trained language model. These visualizations are all created using Ecco, the open-source package we're releasing In the first part of this series, Interfaces for Explaining Transformer Language Models, we showcased interactive interfaces for input saliency and neuron activations. In this article, we will focus on the hidden state as it evolves from model layer to the next. By looking at the hidden states produced by every transformer decoder block, we aim to gleam information about how a language model arrived at a specific output token. This method is explored by Voita et al.. Nostalgebraist presents compelling visual treatments showcasing the evolution of token rankings, logit scores, and softmax probabilities for the evolving hidden state through the various layers of the model.

Recap: Transformer Hidden States

The following figure recaps how a transformer language model works. How the layers result in a final hidden state. And how that final state is then projected to the output vocabulary which results in a score assigned to each token in the model's vocabulary. We can see here the top scoring tokens when DistilGPT2 is fed the input sequence " 1, 1, ":

Generate one token to complete this input string output = lm.generate(" 1, 1, 1,", generate=1) # Visualize output.layer_predictions(position=6, layer=5)

Scores after each layer

Applying the same projection to internal hidden states of the model gives us a view of how the model's conviction for the output scoring developed over the processing of the inputs. This projection of internal hidden states gives us a sense of which layer contributed the most to elevating the scores (and hence ranking) of a certain potential output token.

Viewing the evolution of the hidden states means that instead of looking only at the candidates output tokens from projecting the final model state, we can look at the top scoring tokens after projecting the hidden state resulting from each of the model's six layers.

Visualize the top scoring tokens after each layer output.layer_predictions(position=6)

Evolution of the selected token

Another visual perspective on the evolving hidden states is to re-examine the hidden states after selecting an output token to see how the hidden state after each layer ranked that token. This is one of the many perspectives explored by Nostalgebraist and the one we think is a great first approach. In the figure on the side, we can see the ranking (out of +50,0000 tokens in the model's vocabulary) of the token ' 1' where each row indicates a layer's output.

The same visualization can then be plotted for an entire generated sequence, where each column indicates a generation step (and its output token), and each row the ranking of the output token at each layer:

Let us demonstrate this visualization by presenting the following input to GPT2-Large:

Visualizaing the evolution of the hidden states sheds light on how various layers contribute to generating this sequence as we can see in the following figure:

Figure: Hidden state evolution of an output sequence Click to open image in full resolution. The figure reveals: Columns of solid pink corresponding to newlines and periods. Starting from Layer #0 and onwards, the model is certain early on of these tokens, indicating Layer #0's awareness of certain syntactic properties (and that later layers raise no objections).

Columns where country names are predicted are very bright at the top and it's up to the last five layers to really come up with the appropriate token.

Columns tracking the incrementing number tend to be resolved at layer #9.

The model erroneously lists Chile in the list, not a EU country. But notice that the ranking of that token is 43 -- indicating the error is better attributed to our token sampling method rather than to the model itself. In the case of all other countries they were correct and in the top 3.

Aside from Chile, the rest of the countries are correct, but also follow the alphabetical order followed in the input sequence.

Rankings of Other Tokens

We are not limited to watching the evolution of only one (the selected) token for a specific position. There are cases where we want to compare the rankings of multiple tokens in the same position regardless if the model selected them or not.

One such case is the number prediction task described by Linzen et al. which arises from the English language phenomenon of subject-verb agreement. In that task, we want to analyze the model's capacity to encode syntactic number (whether the subject we're addressing is singular or plural) and syntactic subjecthood (which subject in the sentence we're addressing).

Put simply, fill-in the blank. The only acceptable answers are 1) is 2) are:

The keys to the cabinet ______

To answer correctly, one has to first determine whether we're describing the keys (possible subject #1) or the cabinet (possible subject #2). Having decided it is the keys, the second determination would be whether it is singular or plural.

Contrast your answer for the first question with the following variation:

The key to the cabinets ______

The figures in this section visualize the hidden-state evolution of the tokens " is" and " are". The numbers in the cells are their ranking in the position of the blank (Both columns address the same position in the sequence, they're not subsequent positions as was the case in the previous visualization).

The first figure (showing the rankings for the sequence "The keys to the cabinet") raises the question of why do five layers fail the task and only the final layer sets the record straight. This is likely a similar effect to that observed in BERT of the final layer being the most task-specific. It is also worth investigating whether that capability of succeeding at the task is predominantly localized in Layer 5, or if the Layer is only the final expression in a circuit spanning multiple layers which is especially sensitive to subject-verb agreement.

Probing for bias

This method can shed light on questions of bias and where they might emerge in a model. The following figures, for example, probe for the model's gender expectation associated with different professions:

Figure: Probing bias in the model's association of gender with professions - Doctor and nurse The first five layers all rank " man" higher than " woman" for both professions. For the nursing profession, the final layer decisively elevates " woman" to a higher ranking than " man".

More systemaic and nuanced examination of bias in contextualized word embeddings (another term for the vectors we've been referring to as "hidden states") can be found in .

Output Token Scores

Evolution of Selected Token

Comparing Token Rankings

Input: "Heathrow airport is located in the city of" Model: DistilGPT2

Input: "Some of the most glorious historical attractions in Spain date from the period of Muslim rule, including The Mezquita, built as the Great Mosque of Cordoba and the Medina Azahara, also in Cordoba and now in ruins but still visitable as such and built as the Madinat al-Zahra, the Palace of al-Andalus; and the Alhambra in Granada, a splendid, intact palace. There are also two synagogues still standing that were built during the era of Muslim Spain: Santa Maria la Blanca in Toledo and the Synagogue of Cordoba, in the Old City. Reconquista and Imperial era" Model: DistilGPT2

Model: GPT2-Large

Input: "The countires of the European Union are:\n1. Austria\n2. Belgium\n3. Bulgaria\n4." Model: DistilGPT2

Model: GPT2-Large

Acknowledgements

This article was vastly improved thanks to feedback on earlier drafts provided by Abdullah Almaatouq, Anfal Alatawi, Fahd Alhazmi, Hadeel Al-Negheimish, Isabelle Augenstein, Jasmijn Bastings, Najwa Alghamdi, Pepa Atanasova, and Sebastian Gehrmann.

Alammar, J. (2021). Finding the Words to Say: Hidden State Visualizations for Language Models [Blog post]. Retrieved from https://jalammar.github.io/hidden-states/

@misc{alammar2021hiddenstates, title={Finding the Words to Say: Hidden State Visualizations for Language Models}, author={Alammar, J}, year={2021}, url={https://jalammar.github.io/hidden-states/} }

![Figure: Finding the words to say

After a language model generates a sentence, we can visualize a view of how the model came by each word (column). Each row is a model layer. The value and color indicate the ranking of the output token at that layer. The darker the color, the higher the ranking. Layer 0 is at the top. Layer 47 is at the bottom.

Model:GPT2-XL](/images/explaining/rankings-gpt2xl.png)

![Figure: Recap of transformer language models.

This figure shows how the model arrives at the top five output token candidates and their probability scores. This shows us that at the final layer, the

model is 59% sure the next token is ' 1', and that would be chosen as the output token by greedy decoding.

Other probable outputs include ' 2' with 18% probability (maybe we are counting) and ' 0' with 5%

probability (maybe we are counting down).](/images/explaining/transformer-language-model-steps.png)

Figure: Ten tokens with highest probabilities at the final layer of the model.

Figure: projecting inner hidden states to the model's vocabulary reveals cues of processing between layers.

![Figure: Top scoring tokens after each of the model's six layers.

Each row shows the top ten predicted tokens obtained by projecting each hidden state to the output

vocabulary. The probability scores are shown in pink (obtained by passing logit scores through softmax). We

can see that Layer 0 has no digits in its top ten predictions. Layer 1

gives the token ' 1' a 0.03%, probability which, while low, still ranks the token as the seventh highest

ranking token. Subsequent layers keep elevating the probability and ranking of ' 1', until the final

layer injects a bit more caution by reducing the probability from 100% to ~60%, still retaining the

token as the highest ranked in the model's output.

Note: This figure is incorrect in showing 0 probability assigned to some tokens due to rounding. The current version of Ecco fixes this by showing '<0.01%'.](/images/explaining/predictions%20all%20layers.PNG)

![The ranking of the token ' 1' after each layer

Layer 0 elevated the token ' 1' to be the 31st highest scored token in the hidden state it

produced. Layers 1 and 2 kept increasing the ranking (to 7 then 5 respectively). All the

following layers were sure this is the best token and gave it the top ranking spot.](/images/explaining/logit_ranking_1.png)

![Evolution of the rankings of the output sequence ' 1 , 1'

We can see that Layer 3 is the point at which the model started to be certain of

the digit ' 1' as the output.

When the output is to be a comma, Layer 0 usually ranks

the comma as 5.

When the output is to be a ' 1', Layer 0 is less certain, but still ranks the ' 1' token at

31 or 32.

Notice that every output token is ranked #1 after Layer 5. That is the definition of greedy

sampling -- the reason we selected this token is because it was ranked first.](/images/expl

この記事をシェア

Apple Machine Learning重要度42026年7月2日 09:00

MemoryLLM：トランスフォーマー向けのプラグ・アンド・プレイ型解釈可能なフィードフォワードメモリ

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

MarkTechPost重要度52026年7月4日 07:20

Mistral AI、Apache-2.0ライセンスのLean 4用コードエージェント「Leanstral 1.5」を公開しPutnamBenchで672問中587問を解決

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む