Lilian Weng·2023年1月27日 09:00·約2分で読める

トランスフォーマーファミリー版2.0

#Transformer #LLM #Deep Learning Architecture #NLP

TL;DR

Lilian Weng が執筆した本記事は、過去 3 年間のトランスフォーマーアーキテクチャの進歩を体系的に再構成・更新し、専門家のための包括的な技術リファレンスとして機能する。

AI深層分析2026年5月3日 07:05

重要/ 5段階

深度40%

キーポイント

アーキテクチャの体系的な再構築と拡張

2020 年の初版をベースにセクション階層を再構成し、最新の論文を反映させて約 2 倍の長さに拡充した包括的なアップデート版。

標準化された数学的表記法の定義

モデルサイズ（d）、ヘッド数（h）、シーケンス長（L）など、トランスフォーマー研究で頻出する主要な記号と行列の定義を明確に整理している。

最新技術動向の網羅的カバー

マルチヘッドアテンションや MoE（Mixture of Experts）など、最新のアーキテクチャ改良点を含む、旧バージョンの上位互換となる内容を提供している。

研究コミュニティへの貢献

複雑な変遷を整理したこの記事は、研究者やエンジニアがトランスフォーマーファミリーの全体像を理解するための重要な参照資料として機能する。

影響分析・編集コメントを表示

影響分析

この記事は、急速に進化するトランスフォーマー研究の全体像を整理し、研究者やエンジニアが最新のアーキテクチャ改良点を効率的に把握するための重要な指針となる。特に、複雑な技術動向を体系的に構造化した点において、実務および学術研究における学習コストの削減と理解の深化に寄与する。

編集コメント

技術的な詳細を網羅的に整理した良質なリファレンス記事です。最新のアーキテクチャ改良点を体系的に理解したい開発者や研究者にとって必読の資料と言えます。

私の「The Transformer Family」に関する投稿から約3年前以来、多くの新しいTransformerアーキテクチャの改善案が提案されています。ここでは2020年の投稿を大規模にリファクタリングし、セクションの階層構造を再構築するとともに、より最新の論文を用いて多くのセクションを充実させました。Version 2.0 は旧バージョンのスーパーセットであり、長さは約2倍になっています。

記号

Symbol Meaning

d モデルサイズ / ヒdden状態次元 / 位置エンコーディングサイズ。

h マルチヘッドアテンション層におけるヘッド数。

L 入力シーケメントのセグメント長さ。

N モデル内のアテンション層の総数（MoE は考慮しない）。

X ∈ ℝ^{L × d} 各要素が次元 d の埋め込みベクトルにマッピングされた入力シーケンス。これはモデルサイズと同じです。

W^k ∈ ℝ^{d × d_k} キー重み行列。

W^q ∈ ℝ^{d × d_k} クエリ重み行列。

W^v ∈ ℝ^{d × d_v} バリュー重み行列。通常、d_k = d_v = d となります。

$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$

ヘッドごとの重み行列。

$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$

出力重み行列。

$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$

クエリ埋め込み入力。

$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$

キー埋め込み入力。

$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$

バリュー埋め込み入力。

$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$

クエリ、キー、バリュー行列（それぞれ$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$）内の行ベクトル。

$S_i$

第$i$番目のクエリ$\mathbf{q}_i$が注目すべきキーの位置の集合。

$\mathbf{A} \in \mathbb{R}^{L \times L}$

長さ$L$の入力系列とそれ自身との間の自己アテンション行列。$\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$。

$a_{ij} \in \mathbf{A}$

クエリ$\mathbf{q}_i$とキー$\mathbf{k}_j$間のスカラーアテンションスコア。

$\mathbf{P} \in \mathbb{R}^{L \times d}$

位置エンコーディング行列であり、$i$番目の行 $\mathbf{p}_i$ は入力 $\mathbf{x}_i$ のための位置エンコーディングです。

Transformer Basics

Transformer（他の改良版と区別するためにここでは「バニラ Transformer」と呼ぶことにします；Vaswani, et al., 2017）モデルは、多くのNMT モデルで一般的に使用されているエンコーダ・デコーダアーキテクチャを持っています。後に簡略化された Transformer は、エンコーダのみを用いた BERT やデコーダのみを用いた GPT などの言語モデルタスクにおいて優れた性能を発揮することが示されました。

原文を表示

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.

Notations

Symbol

Meaning

$d$

The model size / hidden state dimension / positional encoding size.

$h$

The number of heads in multi-head attention layer.

$L$

The segment length of input sequence.

$N$

The total number of attention layers in the model; not considering MoE.

$\mathbf{X} \in \mathbb{R}^{L \times d}$

The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.

$\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$

The key weight matrix.

$\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$

The query weight matrix.

$\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$

The value weight matrix. Often we have $d_k = d_v = d$.

$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$

The weight matrices per head.

$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$

The output weight matrix.

$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$

The query embedding inputs.

$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$

The key embedding inputs.

$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$

The value embedding inputs.

$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$

Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$.

$S_i$

A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to.

$\mathbf{A} \in \mathbb{R}^{L \times L}$

The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$.

$a_{ij} \in \mathbf{A}$

The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$.

$\mathbf{P} \in \mathbb{R}^{L \times d}$

position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$.

Transformer Basics

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.

この記事をシェア

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

Anthropic が提供する AI チャットボット「Claude」が、従来 ChatGPT が独占していた有料顧客市場において支持を集め、シェア拡大に成功していることが示された。

NVIDIA Developer Blog★42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

NVIDIA は、TensorRT の新機能であるマルチデバイス推論サポートを活用し、複数の GPU にわたって AI 推論を効率的にスケーリングする手法を発表した。これにより大規模モデルの実行性能が向上する。

AWS Machine Learning Blog★42026年6月26日 01:41

NVIDIA Blackwell を用いた Amazon SageMaker AI でのモデル学習の最適化

AWS は、NVIDIA の最新 GPU「Blackwell」を活用することで、Amazon SageMaker AI 上で大規模 AI モデルの学習におけるメモリ制約やシーケンス長の制限といった課題を克服し、実用的な運用範囲を広げる方法を発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Lilian Weng·2023年1月27日 09:00·約2分で読める

トランスフォーマーファミリー版2.0

#Transformer #LLM #Deep Learning Architecture #NLP

TL;DR

AI深層分析2026年5月3日 07:05

重要/ 5段階

深度40%

キーポイント

アーキテクチャの体系的な再構築と拡張

2020 年の初版をベースにセクション階層を再構成し、最新の論文を反映させて約 2 倍の長さに拡充した包括的なアップデート版。

標準化された数学的表記法の定義

モデルサイズ（d）、ヘッド数（h）、シーケンス長（L）など、トランスフォーマー研究で頻出する主要な記号と行列の定義を明確に整理している。

最新技術動向の網羅的カバー

マルチヘッドアテンションや MoE（Mixture of Experts）など、最新のアーキテクチャ改良点を含む、旧バージョンの上位互換となる内容を提供している。

研究コミュニティへの貢献

影響分析・編集コメントを表示

影響分析

編集コメント

記号

Symbol Meaning

d モデルサイズ / ヒdden状態次元 / 位置エンコーディングサイズ。

h マルチヘッドアテンション層におけるヘッド数。

L 入力シーケメントのセグメント長さ。

N モデル内のアテンション層の総数（MoE は考慮しない）。

X ∈ ℝ^{L × d} 各要素が次元 d の埋め込みベクトルにマッピングされた入力シーケンス。これはモデルサイズと同じです。

W^k ∈ ℝ^{d × d_k} キー重み行列。

W^q ∈ ℝ^{d × d_k} クエリ重み行列。

W^v ∈ ℝ^{d × d_v} バリュー重み行列。通常、d_k = d_v = d となります。

$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$

ヘッドごとの重み行列。

$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$

出力重み行列。

$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$

クエリ埋め込み入力。

$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$

キー埋め込み入力。

$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$

バリュー埋め込み入力。

$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$

クエリ、キー、バリュー行列（それぞれ$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$）内の行ベクトル。

$S_i$

第$i$番目のクエリ$\mathbf{q}_i$が注目すべきキーの位置の集合。

$\mathbf{A} \in \mathbb{R}^{L \times L}$

長さ$L$の入力系列とそれ自身との間の自己アテンション行列。$\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$。

$a_{ij} \in \mathbf{A}$

クエリ$\mathbf{q}_i$とキー$\mathbf{k}_j$間のスカラーアテンションスコア。

$\mathbf{P} \in \mathbb{R}^{L \times d}$

位置エンコーディング行列であり、$i$番目の行 $\mathbf{p}_i$ は入力 $\mathbf{x}_i$ のための位置エンコーディングです。

Transformer Basics

原文を表示

Notations

Symbol

Meaning

$d$

The model size / hidden state dimension / positional encoding size.

$h$

The number of heads in multi-head attention layer.

$L$

The segment length of input sequence.

$N$

The total number of attention layers in the model; not considering MoE.

$\mathbf{X} \in \mathbb{R}^{L \times d}$

The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.

$\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$

The key weight matrix.

$\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$

The query weight matrix.

$\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$

The value weight matrix. Often we have $d_k = d_v = d$.

$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$

The weight matrices per head.

$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$

The output weight matrix.

$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$

The query embedding inputs.

$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$

The key embedding inputs.

$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$

The value embedding inputs.

$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$

Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$.

$S_i$

A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to.

$\mathbf{A} \in \mathbb{R}^{L \times L}$

The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$.

$a_{ij} \in \mathbf{A}$

The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$.

$\mathbf{P} \in \mathbb{R}^{L \times d}$

position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$.

Transformer Basics

この記事をシェア

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

NVIDIA Developer Blog★42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

AWS Machine Learning Blog★42026年6月26日 01:41

NVIDIA Blackwell を用いた Amazon SageMaker AI でのモデル学習の最適化

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

トランスフォーマーファミリー版2.0

キーポイント

影響分析

編集コメント

記号

Transformer Basics

Notations

Transformer Basics

関連記事

トランスフォーマーファミリー版2.0

キーポイント

影響分析

編集コメント

記号

Transformer Basics

Notations

Transformer Basics

関連記事