Jay Alammar·2022年1月3日 09:00·約7分

図解リトリーバルトランスフォーマー

#RETRO #検索拡張言語モデル #効率化 #DeepMind #GPT-3 #パラメータ効率

TL;DR

DeepMindのRETRO Transformerは、検索機能を統合することでGPT-3と同等の性能を4%の規模で達成し、大規模化以外の言語モデル性能向上の道筋を示した。

AI深層分析2026年2月27日 22:50

重要/ 5段階

深度40%

キーポイント

規模拡大以外の性能向上アプローチ

RETRO Transformerは、検索機能を統合することでGPT-3の4%の規模で同等性能を達成し、モデル大規模化が唯一の性能向上手法ではないことを実証した。

言語情報と世界知識の分離

言語モデリングにおいて、言語情報と事実情報を分離することで効率化を図り、検索機能による事実情報の補完がモデル小型化を可能にした。

検索機能統合の実装

RETROはニューラルデータベースを活用し、テキスト生成中に必要な事実情報を検索・取得するメカニズムを組み込んでいる。

業界トレンドの転換

OpenAIのWebGPTとともに、検索機能統合による小型高性能モデルの可能性を示し、大規模モデル一辺倒の業界トレンドに変化をもたらしている。

影響分析・編集コメントを表示

影響分析

この記事は、言語モデル開発におけるパラダイムシフトを示しており、大規模化競争から効率化・実用化重視への転換を促す可能性がある。計算コスト削減と環境負荷低減の観点からも、業界全体に広範な影響を与える重要な進展と言える。

編集コメント

大規模化一辺倒の業界トレンドに一石を投じる内容で、実用性と革新性のバランスが取れた重要な技術解説記事。

イラストで解説する検索トランスフォーマー

ディスカッション: コメント、修正、フィードバックはこちらのディスカッションスレッドへ。翻訳: 韓国語、ロシア語

要約: 最新の言語モデルは、データベースへの問い合わせやウェブ検索によって情報を得られるようにすることで、はるかに小型化しながらもGPT-3と同等の性能を達成できます。これは、性能向上の唯一の道がモデルの巨大化ではないことの重要な指標です。

ここ数年、大規模言語モデル（LLM）が台頭しました。これは、機械による言語処理と生成の能力を急速に向上させる機械学習モデルです。2017年以降の主なハイライトは以下の通りです。

オリジナルのTransformerが、機械翻訳の従来の性能記録を塗り替えました。

BERTが、事前学習とその後でのファインチューニングというプロセス、およびTransformerベースの文脈化単語埋め込みを一般化させました。その後すぐにGoogle検索やBing検索の核となりました。

GPT-2が、機械が人間と同等の文章を書く能力を示しました。

最初にT5、次にT0が、転移学習（あるタスクでモデルを訓練し、その後関連する他のタスクでも良好に機能させる）の限界を押し広げ、多様なタスクをテキストからテキストへのタスクとして提示しました。

GPT-3は、生成モデルの大規模なスケーリングが驚くべき創発的応用（産業界はGopherやMT-NLGなど、より大規模なモデルの訓練を続けています）につながることを示しました。

しばらくの間、性能を向上させる主要な方法は、ますます大規模なモデルにスケールすることであるように思われました。DeepMindのRETRO TransformerやOpenAIのWebGPTといったこの分野の最近の発展は、検索/情報照会の方法を組み込むことで、より小型の生成言語モデルが巨大モデルと同等の性能を発揮できることを示し、このトレンドを逆転させています。

この記事では、DeepMindのRETRO（Retrieval-Enhanced TRansfOrmer）とその仕組みを解説します。このモデルは、GPT-3の4%のサイズ（GPT-3 Da Vinciの1850億パラメータに対し75億パラメータ）にもかかわらず、GPT-3と同等の性能を発揮します。

RETROは論文『Improving Language Models by Retrieving from Trillions of Tokens』で発表されました。これは研究コミュニティにおける多様な検索関連の研究を継承し、発展させたものです。この記事は、特に新奇な点ではなく、モデルそのものについて説明します。

なぜこれが重要なのか：言語情報と世界知識情報の分離

言語モデリングは、本質的には、文章の末尾の空白を埋める（次の単語を予測する）ようにモデルを訓練します。

空白を埋めるには、事実情報（例えば名前や日付）の知識が必要な場合があります。例えば：

他の場合では、言語への習熟さえあれば、空白に入るものを推測できます。例えば：

この区別が重要なのは、LLMが知っているすべてのことをモデルのパラメータ内にエンコードしているからです。これは言語情報については理にかなっていますが、事実や世界知識の情報については非効率です。

言語モデルに検索方法を含めることで、モデルをはるかに小さくすることができます。ニューラルデータベースが、テキスト生成中に必要な事実情報を検索するのを支援します。

訓練データの記憶量が減るため、小型言語モデルでは訓練が高速になります。誰もがより小型で手頃なGPU上にこれらのモデルをデプロイし、必要に応じて調整できます。

機構的には、RETROはオリジナルのTransformerと同様にエンコーダ-デコーダモデルです。しかし、検索データベースの助けを借りて入力シーケンスを拡張します。モデルはデータベース内で最も確からしいシーケンスを見つけ、それを入力に追加します。RETROはその魔法を働かせて出力予測を生成します。

モデルアーキテクチャを探る前に、検索データベースについてさらに深く掘り下げましょう。

RETROの検索データベースを調査する

データベースはキー・バリューストアです。

キーは標準的なBERT文埋め込みです。

バリューは2つの部分からなるテキストです：

近傍：キーの計算に使用される部分

補完：元の文書におけるテキストの続き

RETROのデータベースは、MassiveTextデータセットに基づく2兆の多言語トークンを含みます。近傍チャンクと補完チャンクのいずれも、最大64トークンの長さです。

RETROは入力プロンプトを複数のチャンクに分割します。簡単のため、1つのチャンクがどのように検索テキストで拡張されるかに焦点を当てます。ただし、モデルは入力プロンプト内の各チャンク（最初を除く）に対してこの処理を行います。

データベース検索

RETROに入力される前に、入力プロンプトはBERTに入力されます。出力された文脈化ベクトルは平均化され、文埋め込みベクトルが構築されます。そのベクトルはデータベースへのクエリとして使用されます。

その文埋め込みは、近似最近傍探索（https://github.com/google-research/google-research/tree/master/scann）で使用されます。

最も近い2つの近傍が検索され、そのテキストがRETROへの入力の一部となります。

これがRETROへの入力となります。入力プロンプトと、データベースから得られたその2つの最も近い近傍（およびその続き）です。

ここから、TransformerとRETROブロックがその情報を処理に組み込みます。

高レベルでのRETROアーキテクチャ

RETROのアーキテクチャは、エンコーダスタックとデコーダスタックで構成されます。

エンコーダは標準的なTransformerエンコーダブロック（自己注意＋FFNN）で構成されています。私の理解の限りでは、Retroは2つのTransformerエンコーダブロックで構成されるエンコーダを使用しています。

デコーダスタックは2種類のデコーダブロックを交互に配置しています：

標準Transformerデコーダブロック（ATTN + FFNN）

RETROデコーダブロック（ATTN + チャンク化クロスアテンション(CCA) + FFNN）

まず、検索された近傍を処理するエンコーダスタックを見てみましょう。これにより、後にアテンションに使用されるKEYSおよびVALUES行列が生成されます（復習にはThe Illustrated Transformerを参照）。

デコーダブロックは、GPTが行うのと同様に、入力テキストを処理します。プロンプトトークンに対して自己注意を適用し（因果的に、つまり過去のトークンのみに注意を向け）、その後FFNN層を通過させます。

RETROデコーダに到達した時点で、初めて検索された情報の組み込みが始まります。9番目から始めて3ブロックごとにRETROブロックが配置されます（これにより、その入力は近傍に注意を向けることができます）。したがって、9、12、15…32層がRETROブロックです（2つのより小型のRetroモデルとRetrofitモデルでは、これらの層は9層目ではなく6層目から始まります）。

したがって実質的に、これが検索された情報が必要な日付を参照してプロンプトを完成させることができるステップです。

言語モデルを検索技術で支援することは、活発な研究領域です。この分野の以前の研究には以下が含まれます：

Improving Neural Language Models with a Continuous Cache

Generalization through Memorization: Nearest Neighbor Language Models

Meta AIのRetrieval Augmented Generationブログを読み、Jackie Chi Kit Cheungの講義『Leveraging External Knowledge in Natural Language Understanding Systems』を参照してください。

SPALM: Adaptive Semiparametric Language Models

DPR: Dense Passage Retrieval for Open-Domain Question Answering

REALM: Retrieval-Augmented Language Model Pre-Training

FiD: Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

EMDR: End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

BlenderBot 2.0: Internet-Augmented Dialogue Generation

修正やフィードバックは、このスレッドに投稿するか、Twitterで私に連絡してください。

![](/images/retr

原文を表示

The Illustrated Retrieval Transformer

Discussion: Discussion Thread for comments, corrections, or any feedback. Translations: Korean, Russian

Summary: The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance.

The last few years saw the rise of Large Language Models (LLMs) – machine learning models that rapidly improve how machines process and generate language. Some of the highlights since 2017 include:

The original Transformer breaks previous performance records for machine translation.

BERT popularizes the pre-training then finetuning process, as well as Transformer-based contextualized word embeddings. It then rapidly starts to power Google Search and Bing Search.

GPT-2 demonstrates the machine’s ability to write as well as humans do.

First T5, then T0 push the boundaries of transfer learning (training a model on one task, and then having it do well on other adjacent tasks) and posing a lot of different tasks as text-to-text tasks.

GPT-3 showed that massive scaling of generative models can lead to shocking emergent applications (the industry continues to train larger models like Gopher, MT-NLG…etc).

For a while, it seemed like scaling larger and larger models is the main way to improve performance. Recent developments in the field, like DeepMind’s RETRO Transformer and OpenAI’s WebGPT, reverse this trend by showing that smaller generative language models can perform on par with massive models if we augment them with a way to search/query for information.

This article breaks down DeepMind’s RETRO (Retrieval-Enhanced TRansfOrmer) and how it works. The model performs on par with GPT-3 despite being 4% its size (7.5 billion parameters vs. 185 billion for GPT-3 Da Vinci).

RETRO was presented in the paper Improving Language Models by Retrieving from Trillions of Tokens. It continues and builds on a wide variety of retrieval work in the research community. This article explains the model and not what is especially novel about it.

Why This is Important: Separating Language Information from World Knowledge Information

Language modeling trains models to predict the next word–to fill-in-the-blank at the end of the sentence, essentially.

Filling the blank sometimes requires knowledge of factual information (e.g. names or dates). For example:

Other times, familiarity with the language is enough to guess what goes in the blank. For example:

This distinction is important because LLMs encoded everything they know in their model parameters. While this makes sense for language information, it is inefficient for factual and world-knowledge information.

By including a retrieval method in the language model, the model can be much smaller. A neural database aids it with retrieving factual information it needs during text generation.

Training becomes fast with small language models, as training data memorization is reduced. Anyone can deploy these models on smaller and more affordable GPUs and tweak them as per need.

Mechanically, RETRO is an encoder-decoder model just like the original transformer. However, it augments the input sequence with the help of a retrieval database. The model finds the most probable sequences in the database and adds them to the input. RETRO works its magic to generate the output prediction.

Before we explore the model architecture, let’s dig deeper into the retrieval database.

Inspecting RETRO’s Retrieval Database

The database is a key-value store.

The key is a standard BERT sentence embedding.

The value is text in two parts:

Neighbor, which is used to compute the key

Completion, the continuation of the text in the original document.

RETRO’s database contains 2 trillion multi-lingual tokens based on the MassiveText dataset. Both the neighbor and completion chunks are at most 64 tokens long.

RETRO breaks the input prompt into multiple chunks. For simplicity, we’ll focus on how one chunk is augmented with retrieved text. The model, however, does this process for each chunk (except the first) in the input prompt.

The Database Lookup

Before hitting RETRO, the input prompt goes into BERT. The output contextualized vectors are then averaged to construct a sentence embedding vector. That vector is then used to query the database.

That sentence embedding is then used in an approximate nearest neighbor search (https://github.com/google-research/google-research/tree/master/scann).

The two nearest neighbors are retrieved, and their text becomes a part of the input into RETRO.

This is now the input to RETRO. The input prompt and its two nearest neighbors from the database (and their continuations).

From here, the Transformer and RETRO Blocks incorporate the information into their processing.

RETRO Architecture at a High Level

RETRO’s architecture is an encoder stack and a decoder stack.

The encoder is made up of standard Transformer encoder blocks (self-attention + FFNN). To my best understanding, Retro uses an encoder made up of two Transformer Encoder Blocks.

The decoder stack interleaves two kinds of decoder blocks:

Standard transformer decoder block (ATTN + FFNN)

RETRO decoder block (ATTN + Chunked cross attention (CCA) + FFNN)

Let’s start by looking at the encoder stack, which processes the retrieved neighbors, resulting in KEYS and VALUES matrices that will later be used for attention (see The Illustrated Transformer for a refresher).

Decoder blocks process the input text just like a GPT would. It applies self-attention on the prompt token (causally, so only attending to previous tokens), then passes through a FFNN layer.

It’s only when a RETRO decoder is reached do we start to incorporate the retrieved information. Every third block starting from 9 is a RETRO block (that allows its input to attend to the neighbors). So layers 9, 12, 15…32 are RETRO blocks. (The two smaller Retro models, and the Retrofit models have these layers starting from the 6th instead of the 9th layer).

So effectively, this is the step where the retrieved information can glance at the dates it needs to complete the prompt.

Aiding language models with retrieval techniques has been an active area of research. Some of the previous work in the space includes:

Improving Neural Language Models with a Continuous Cache

Generalization through Memorization: Nearest Neighbor Language Models

Read the Retrieval Augmented Generation blog from Meta AI and go through Jackie Chi Kit Cheung’s lecture on Leveraging External Knowledge in Natural Language Understanding Systems

SPALM: Adaptive Semiparametric Language Models

DPR: Dense Passage Retrieval for Open-Domain Question Answering

REALM: Retrieval-Augmented Language Model Pre-Training

FiD: Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

EMDR: End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

BlenderBot 2.0: Internet-Augmented Dialogue Generation

Please post in this thread or reach out to me on Twitter for any corrections or feedback.

この記事をシェア

Jay Alammar2025年3月26日 09:00

Substackへの移行

Jay Alammar2023年5月9日 09:00

生成AIとAI製品の競争優位性

Jay Alammar2023年1月1日 09:00

AI画像生成で古いコンピューターグラフィックスを再構築

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む