Jay Alammar·2020年12月17日 09:00·約9分

トランスフォーマー言語モデルを説明するためのインターフェース

#LLM #Explainable AI (XAI)#Transformer #Hugging Face #Open Source

TL;DR

Jay Alammar は、Transformer モデルの内部動作を可視化するためのインタラクティブなツールとオープンソースライブラリ「Ecco」を発表し、入力サリエンスやニューロン活性化の分析を通じてモデルの解釈可能性を高める重要な進展を示した。

AI深層分析2026年5月3日 07:13

重要/ 5段階

深度40%

キーポイント

インタラクティブな可視化ツールの提供

入力トークンの重要度（サリエンス）とニューロンの活性化パターンを直感的に探索できる「Explorables」を紹介し、モデルの黒箱性を解くための具体的な手法を示している。

解釈可能機械学習（IML）の実践的アプローチ

入力サリエンスによる個別予測の説明と、ニューロン分析による複雑なモデル構成要素の透明性向上という 2 つの主要な IML 手法を Transformer アーキテクチャに適用している。

オープンソースライブラリ「Ecco」の公開

Jupyter ノートブック上で Hugging Face の GPT ベースモデルから同様のインタラクティブなインターフェースを構築できる「Ecco」というライブラリを公開し、再現性と実装の容易さを担保している。

次世代分析への展望

本記事では自己回帰モデルに焦点を当てているが、今後のシリーズで層間の隠れ状態の進化（Hidden State Evolution）を分析し、各層の役割解明を目指すことを示唆している。

言語生成タスクへの説明可能性インターフェースの適用

従来の NLP 説明手法は分類タスクが中心だったが、本記事はトークン生成ごとに特徴重要度を計算し、出力トークンをホバーすることで入力に対するサリエンスマップを表示する新しいインターフェースを提案している。

GPT2-XL の世界知識とトークン化の検証

シェイクスピアの生年月日を問うテストでは、モデルが「1564」を二つのトークンに分割して正しく出力し、名前やキーワードへの依存度（重要度）を定量的に示すことで世界知識と推論プロセスを検証した。

パターン認識能力の可視化

EU 諸国のリスト生成タスクでは、モデルがアルファベット順や記号の繰り返しといったテキスト内のパターンを学習して継続している様子を確認し、サンプリングデコーディングを用いた非貪欲な生成結果における重要度分布を詳細に分析した。

影響分析・編集コメントを表示

影響分析

この記事は、大規模言語モデルのブラックボックス化が進む中で、その内部メカニズムを可視化し理解するための具体的なツールと手法を提供した点で極めて重要です。特に「Ecco」というオープンソースライブラリの公開により、理論的な解釈可能性研究が実際の開発現場や教育の場で即座に活用可能なリソースへと昇華され、AI 安全性や信頼性向上への貢献が期待されます。

編集コメント

モデルの内部挙動を可視化する手法は、LLM の信頼性構築において不可欠な要素ですが、Jay Alammar 氏による「Ecco」ライブラリの提供により、そのハードルが大幅に下がりました。技術的な深みと実用性のバランスが取れた優れた解説記事です。

トランスフォーマー言語モデルを説明するためのインターフェース

入力のサリエンシーとニューロンの活性化を見ることで、トランスフォーマー言語モデルを探求するためのインターフェース。

エクスプローラブル #1: 言語モデルによって生成された国のリストの入力サリエンシー。出力トークンをタップまたはホバーしてください：

エクスプローラブル #2: ニューロン活性化分析により、4つのニューロングループが明らかになり、それぞれが特定のタイプのトークン生成に関連しています。左側のスパークラインをタップまたはホバーして特定の要因を分離してください：

トランスフォーマーアーキテクチャは、NLPにおける最近の多くの進歩を支えてきました。このアーキテクチャの詳細な説明はこちらで提供されています。このアーキテクチャに基づく事前学習済み言語モデルは、自己回帰型（GPT2のように、自身の出力を次のタイムステップへの入力として使用し、トークンを左から右に処理するモデル）とデノイジング型（BERTのように、入力を破損/マスキングして訓練され、トークンを双方向に処理するモデル）の両方のバリエーションで、NLPのさまざまなタスク、そして最近ではコンピュータビジョンにおいても限界を押し広げ続けています。しかし、これらのモデルがなぜこれほどうまく機能するのかについての私たちの理解は、これらの発展に遅れをとっています。

この解説シリーズは、トランスフォーマーベースの言語モデルの内部動作を解釈し可視化する取り組みを続けます。トランスフォーマーベースの言語モデルにいくつかの重要な解釈可能性手法がどのように適用されるかを説明します。この記事では自己回帰型モデルに焦点を当てますが、これらの手法は他のアーキテクチャやタスクにも同様に適用可能です。

これはシリーズの最初の記事です。この記事では、以下の直感を助けるエクスプローラブルと可視化を紹介します：

トークンを生成する際の入力トークンの重要性をスコアリングする入力サリエンシー手法。

ニューロン活性化、および個々のニューロンやニューロングループが入力に応答し出力を生成するためにどのようにスパイクするか。

次の記事では、モデルの層全体での隠れ状態の進化と、それが各層の役割について何を語り得るかについて扱います。

Molnar et al.のような解釈可能な機械学習（IML）文献の用語では、入力サリエンシーは個々の予測を説明する手法です。後の2つの手法は「より複雑なモデルの構成要素を分析する」という傘の下に分類され、トランスフォーマーモデルの透明性を高めるものとしてより適切に説明されます。

さらに、この記事には再現可能なノートブックとEccoが付属しています。Eccoは、HuggingFace transformersライブラリのGPTベースモデル向けに、Jupyterノートブック内で直接同様のインタラクティブインターフェースを作成するためのオープンソースライブラリです。

私たちが検討している3つの構成要素をトランスフォーマーのアーキテクチャを探求するために当てはめると、次の図のようになります。

コンピュータビジジョンモデルが画像をハスキー犬を含むと分類する場合、サリエンシーマップは、分類が動物自体の視覚的特性によるものなのか、背景の雪によるものなのかを教えてくれます。これは、モデルの出力と入力の関係を説明する帰属の手法であり、エラーやバイアスを検出し、システムの挙動をよりよく理解するのに役立ちます。

NLPモデルの入力に重要度スコアを割り当てるための複数の手法が存在します。文献では、自然言語生成よりも分類タスクへのこの応用が最も頻繁に扱われています。この記事では言語生成に焦点を当てます。私たちの最初のインターフェースは、各トークンが生成された後に特徴量の重要度を計算し、出力トークンをホバーまたはタップすることで、それを生成する責任のあるトークンにサリエンシーマップを重ね合わせます。

このインターフェースの最初の例では、GPT2-XLにウィリアム・シェイクスピアの生年月日を尋ねています。モデルは正しく日付（1564年、ただしモデルの語彙に「 1564」という単一トークンが含まれていないため、「 15」と「64」の2つのトークンに分割される）を生成することができます。インターフェースは、各出力トークンを生成する際の各入力トークンの重要性を示しています：

エクスプローラブル: 勾配×入力を用いたシェイクスピアの生年の入力サリエンシー。出力トークンをタップまたはホバーしてください。GPT2-XLは、2つのトークンで表されたウィリアム・シェイクスピアの生年月日を伝えることができます。最初のトークンを生成する際、重要性の53%が名前（名に20%、姓に33%）に割り当てられています。次に重要な2つのトークンは「 year」（22%）と「 born」（14%）です。日付を完成させるための2番目のトークンを生成する際、名前が依然として60%の重要性で最も重要であり、続いて日付の最初の部分（モデルの出力ですが、2番目のタイムステップへの入力です）が重要です。このプロンプトは世界知識を探ることを目的としています。これは貪欲デコードを使用して生成されました。GPT2のより小さいバリアントは正しい日付を出力できませんでした。

2番目の例では、モデルの世界知識を探るとともに、モデルがテキスト内のパターン（数字の後のピリオドや改行のような単純なパターン、番号付きリストを完成させるようなやや複雑なパターン）を繰り返すかどうかを確認しようとしています。ここで使用されるモデルはDistilGPT2です。

このエクスプローラブルは、各トークンの帰属割合を表示するより詳細なビューを示しています（その精度が必要な場合に）。

エクスプローラブル: EU諸国のリストの入力サリエンシー。出力トークンをタップまたはホバーしてください。これはDistilGPT2によって生成され、勾配×入力による帰属が行われました。出力シーケンスはヨーロッパの国のみを含むように選別され、サンプリング（非貪欲）デコードを使用しています。モデルの実行によっては、リストに中国、メキシコ、その他の国が含まれることもあります。「 Finland」の繰り返しを除いて、モデルはアルファベット順にリストを続けています。

この記事の残りの部分で例示的に使用するもう一つの例は、モデルに単純なパターンを完成させるように求めるものです：

エクスプローラブル: カンマと数字の1の単純な交互パターンの入力サリエンシー。出力トークンをタップまたはホバーしてください。生成されるすべてのトークンは、入力の最初のトークンに最も高い特徴重要度スコアを帰属させます。そしてシーケンス全体を通して、直前のトークン、およびシーケンス内の最初の3つのトークンがしばしば最も重要です。これはGPT2-XLでの勾配×入力を使用しています。このプロンプトは、構文とトークンパターンに対するモデルの応答を探ることを目的としています。記事の後半では、数字の「 1」を繰り返す代わりにカウントに切り替えることでこれを発展させます。完成は貪欲デコードを使用して得られました。DistilGPT2も正しく完成させることができます。

また、このインターフェースを使用して、トランスフォーマーベースの対話エージェントの応答を分析することも可能です。次の例では、DiabloGPTに実存的な質問を投げかけます：

エクスプローラブル: 究極の質問に対するDiabloGPTの回答の入力サリエンシー。出力トークンをタップまたはホバーしてください。これはプロンプトに対するモデルの最初の応答でした。疑問符は、出力シーケンスの開始時に最も高いスコアが帰属されています。「 will」と「 ever」というトークンを生成する際には、「 ultimate」という単語に顕著に多くの重要性が割り当てられています。これはDiabloGPT-largeでの勾配×入力を使用しています。

勾配ベースのサリエンシーについて

上で示したのは、勾配×入力に基づく特徴重要度のスコアリングです。これは、Atanasova et al.によって、トランスフォーマーモデルにおけるテキスト分類のためのさまざまなデータセットで良好に機能することが示された勾配ベースのサリエンシー手法です。

これがどのように機能するかを説明するために、まずモデルが各タイムステップで出力トークンをどのように生成するかを思い出してみましょう。次の図では、①言語モデルの最終隠れ状態がモデルの語彙空間に投影され、モデルの語彙内の各トークンに対する数値スコアが得られる様子を見ることができます。そのスコアベクトルをソフトマックス演算に通すと、各トークンの確率スコアが得られます。② 私たちはそのベクトルに基づいてトークンを選択します（例えば、最高確率スコアのトークンを選択する、または上位スコアのトークンからサンプリングする）。

③ 選択されたロジット（ソフトマックス前）の勾配を、入力トークンまで逆伝播させることで計算することにより、この生成結果をもたらす計算において各トークンがどれほど重要であったかについての信号を得ます。

原文を表示

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation.

Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens:

Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor:

The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments.

This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well.

This is the first article in the series. In it, we present explorables and visualizations aiding the intuition of:

Input Saliency methods that score input tokens importance to generating a token.

Neuron Activations and how individual and groups of model neurons spike in response to inputs and to produce outputs.

The next article addresses Hidden State Evolution across the layers of the model and what it may tell us about each layer's role.

In the language of Interpretable Machine Learning (IML) literature like Molnar et al., input saliency is a method that explains individual predictions. The latter two methods fall under the umbrella of "analyzing components of more complex models", and are better described as increasing the transparency of transformer models.

Moreover, this article is accompanied by reproducible notebooks and Ecco - an open source library to create similar interactive interfaces directly in Jupyter notebooks for GPT-based models from the HuggingFace transformers library.

If we're to impose the three components we're examining to explore the architecture of the transformer, it would look like the following figure.

When a computer vision model classifies a picture as containing a husky, saliency maps can tell us whether the classification was made due to the visual properties of the animal itself, or because of the snow in the background. This is a method of attribution explaining the relationship between a model's output and inputs -- helping us detect errors and biases, and better understand the behavior of the system.

Multiple methods exist for assigning importance scores to the inputs of an NLP model. The literature is most often concerned with this application for classification tasks, rather than natural language generation. This article focuses on language generation. Our first interface calculates feature importance after each token is generated, and by hovering or tapping on an output token, imposes a saliency map on the tokens responsible for generating it.

The first example for this interface asks GPT2-XL for William Shakespeare's date of birth. The model is correctly able to produce the date (1564, but broken into two tokens: " 15" and "64", because the model's vocabulary does not include " 1564" as a single token). The interface shows the importance of each input token when generating each output token:

Explorable: Input saliency of Shakespeare's birth year using Gradient × Input. Tap or hover over the output tokens. GPT2-XL is able to tell the birth date of William Shakespeare expressed in two tokens. In generating the first token, 53% of the importance is assigned to the name (20% to the first name, 33% to the last name). The next most important two tokens are " year" (22%) and " born" (14%). In generating the second token to complete the date, the name still is the most important with 60% importance, followed by the first portion of the date -- a model output, but an input to the second time step. This prompt aims to probe world knowledge. It was generated using greedy decoding. Smaller variants of GPT2 were not able to output the correct date.

Our second example attempts to both probe a model's world knowledge, as well as to see if the model repeats the patterns in the text (simple patterns like the periods after numbers and like new lines, and slightly more involved patterns like completing a numbered list). The model used here is DistilGPT2.

This explorable shows a more detailed view that displays the attribution percentage for each token -- in case you need that precision.

Explorable: Input saliency of a list of EU countries Tap or hover over the output tokens. This was generated by DistilGPT2 and attribution via Gradients X Inputs. Output sequence is cherry-picked to only include European countries and uses sampled (non-greedy) decoding. Some model runs would include China, Mexico, and other countries in the list. With the exception of the repeated " Finland", the model continues the list alphabetically.

Another example that we use illustratively in the rest of this article is one where we ask the model to complete a simple pattern:

Explorable: Input saliency of a simple alternating pattern of commas and the number one. Tap or hover over the output tokens. Every generated token ascribes the first token in the input the highest feature importance score. Then throughout the sequence, the preceding token, and the first three tokens in the sequence are often the most important. This uses Gradient × Inputs on GPT2-XL. This prompt aims to probe the model's response to syntax and token patterns. Later in the article, we build on it by switching to counting instead of repeating the digit ' 1'. Completion gained using greedy decoding. DistilGPT2 is able to complete it correctly as well.

It is also possible to use the interface to analyze the responses of a transformer-based conversational agent. In the following example, we pose an existential question to DiabloGPT:

Explorable: Input saliency of DiabloGPT's answer to the ultimate question Tap or hover over the output tokens. This was the model's first response to the prompt. The question mark is attributed the highest score in the beginning of the output sequence. Generating the tokens " will" and " ever" assigns noticeably more importance to the word " ultimate". This uses Gradient × Inputs on DiabloGPT-large.

About Gradient-Based Saliency

Demonstrated above is scoring feature importance based on Gradients X Inputs-- a gradient-based saliency method shown by Atanasova et al. to perform well across various datasets for text classification in transformer models.

To illustrate how that works, let's first recall how the model generates the output token in each time step. In the following figure, we see how ① the language model's final hidden state is projected into the model's vocabulary resulting in a numeric score for each token in the model's vocabulary. Passing that scores vector through a softmax operation results in a probability score for each token. ② We proceed to select a token (e.g. select the highest-probability scoring token, or sample from the top scoring tokens) based on that vector.

③ By calculating the gradient of the selected logit (before the softmax) with respect to the inputs by back-propagating it all the way back to the input tokens, we get a signal of how important each token was in the calculation resulting in this generated token. That assumption is based on the idea that the smallest change in the input token with the highest feature-importance value makes a large change in what the resulting output of the model would be.

The resulting gradient vector per token is then multiplied by the input embedding of the respective token. Taking the L2 norm of the resulting vector results in the token's feature importance score. We then normalize the scores by dividing by the sum of these scores.

More formally, gradient × input is described as follows:

∥∇Xifc(X1:n)Xi∥2 \lVert \nabla _{X_i} f_c (X_{1:n}) X_i\lVert_2∥∇Xifc(X1:n)Xi∥2

Where is the embedding vector of the input token at timestep i, and is the back-propagated gradient of the score of the selected token unpacked as follows:

is the list of input token embedding vectors in the input sequence (of length )

is the score of the selected token after a forward pass through the model (selected through any one of a number of methods including greedy/argmax decoding, sampling, or beam search). With the c standing for "class" given this is often described in the classification context. We're keeping the notation even though in our case, "token" is more fitting.

This formalization is the one stated by Bastings et al. except the gradient and input vectors are multiplied element-wise. The resulting vector is then aggregated into a score via calculating the L2 norm as this was empirically shown in Atanasova et al. to perform better than other methods (like averaging).

Neuron Activations

The Feed Forward Neural Network (FFNN) sublayer is one of the two major components inside a transformer block (in addition to self-attention). It accounts for 66% of the parameters of a transformer block and thus provides a significant portion of the model's representational capacity. Previous work has examined neuron firings inside deep neural networks in both the NLP and computer vision domains. In this section we apply that examination to transformer-based language models.

Continue Counting: 1, 2, 3, ___

To guide our neuron examination, let's present our model with the input "1, 2, 3" in hopes it would echo the comma/number alteration, yet also keep incrementing the numbers.

By using the methods we'll discuss in Article #2 (following the lead of nostalgebraist), we can produce a graphic that exposes the probabilities of output tokens after each layer in the model. This looks at the hidden state after each layer, and displays the ranking of the ultimately produced output token in that layer.

For example, in the first step, the model produced the token " 4". The first column tells us about that process. The bottom most cell in that column shows that the token " 4" was ranked #1 in probability after the last layer. Meaning that the last layer (and thus the model) gave it the highest probability score. The cells above indicate the ranking of the token " 4" after each layer.

By looking at the hidden states, we observe that the model gathers confidence about the two patterns of the output sequence (the commas, and the ascending numbers) at different layers.

What happens at Layer 4 which makes the model elevate the digits (4, 5, 6) to the top of the probability distribution?

We can plot the activations of the neurons in layer 4 to get a sense of neuron activity. That is what the first of the following three figures shows.

It is difficult, however, to gain any interpretation from looking at activations during one forward pass through the model.

The figures below show neuron activations while five tokens are generated (' 4 , 5 , 6'). To get around the sparsity of the firings, we may wish to cluster the firings, which is what the subsequent figure shows.

Activations of 200 neurons (out of 3072) in Layer 4's FFNN resulting in the model outputting the token ' 4' Each row is a neuron. Only neurons with positive activation are colored. The darker they are, the more inten

この記事をシェア

MarkTechPost重要度42026年7月3日 05:51

アリババのページエージェント：DOM を介して自然言語で Web インターフェースを制御する JavaScript 内蔵 GUI エージェント

Allen AI (AI2)重要度42026年7月2日 17:00

大規模モジュラー LLM：デンマーク基盤モデルプロジェクトが FlexOlmo を活用し、機密データを共有せずに専門知識を集約する方法

Apple Machine Learning重要度42026年7月2日 09:00

MemoryLLM：トランスフォーマー向けのプラグ・アンド・プレイ型解釈可能なフィードフォワードメモリ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む