TLDR AI·2026年4月29日 09:00·約3分

再帰型トランスフォーマー：より深い実効深度と効率的なデコーディング

#Transformer #Recurrent Neural Networks #LLM Architecture #Efficient Inference

TL;DR

The Recurrent Transformer は、層ごとの再帰的構造と永続的なキー・バリューペアを導入することで、Transformer の「時間的な浅さ」を克服し、大規模な層スタックなしで複雑な反復推論を可能にする新アーキテクチャである。

AI深層分析2026年4月29日 23:07

重要/ 5段階

深度40%

キーポイント

時系列の浅さへの対抗策

標準的な Transformer が抱える「時間的な浅さ（temporal shallowness）」という課題に対し、層ごとの再帰的アプローチで解決を図る。

内部メモリ機構の実装

一時的なキー・バリューペアと永続的なキー・バリューペアを組み合わせ、各層がシークエンス全体を通じて継続的に更新される内部メモリを維持する仕組みを採用している。

効率的な推論の実現

巨大な層スタックを必要とせずとも、複雑で反復的な推論タスクをシークエンス上で実行できるため、計算効率とデコーディング速度の向上が期待される。

影響分析・編集コメントを表示

影響分析

このアーキテクチャは、Transformer モデルの推論深度と効率性のトレードオフを解決する可能性があり、特に複雑な推論タスクや長文処理における計算コスト削減に寄与する。業界全体として、より軽量でありながら高度な推論能力を持つ次世代モデルの開発トレンドに大きな影響を与えるだろう。

編集コメント

標準的な Transformer の限界を打破する新たな試みとして、計算効率と推論深度の両立を目指す研究は非常に注目すべき進展です。

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Costin-Andrei Oncescu∗ Depen Morwani Samy Jelassi Alexandru Meterez Mujin Kwun Sham Kakade

Harvard University

Abstract

Transformers process tokens in parallel but are temporally shallow: at position t, each layer attends to key–value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key–value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near 1 because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from Θ(N^2) to Θ(N log N), increasing effective arithmetic intensity to Θ(N/ log N) for sequence length N. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency. Code is available at https://github.com/geniucos/recurrent-transformer

1 Introduction

Transformers [Vaswani et al., 2017] are highly effective sequence models, but their computation across positions is structurally shallow: within each layer, position t attends to key–value pairs computed from the previous layer embeddings, allowing essentially at most one interaction per layer between any pair of positions. A growing body of theory studies the fundamental limitations implied by bounded depth in attention models, including circuit-complexity characterizations of what low-depth Transformers can and cannot represent [Merrill et al., 2022, Liu et al., 2023]. These perspectives motivate architectures that achieve greater effective depth.

We introduce the Recurrent Transformer (RT), a simple modification of how key–value pairs are computed that makes each layer temporally recurrent. In a standard Transformer, at layer ℓ, the key–value pair at position t is computed from the layer-(ℓ − 1) representation at that position and can then be attended to by later positions t′ > t. In the Recurrent Transformer, by contrast, the key–value pair at position t in layer ℓ is computed from that position’s output at layer ℓ, rather than from its layer-(ℓ − 1) representation. Consequently, a later position t < t′ at layer ℓ attends to a representation at t that already reflects layer ℓ attention and MLP computation. Importantly, Recurrent Transformer performs this recurrence separately within each layer, so each layer maintains its own key–value memory. This differs from the Feedback Transformer [Fan et al., 2020], which uses a shared memory across layers, and this layerwise separation is a key reason why our architecture can be implemented efficiently.

We motivate Recurrent Transformer’s design through lenses of representation, optimization and computational efficiency:

(i) Representational perspective. Recurrent Transformers retains per-token key–value memory just like a Transformer, but increase the space of computations that can be expressed within a single layer by allowing later positions to attend to representations that have already undergone attention and MLP processing. Under mild assumptions, Recurrent Transformers can emulate standard Transformer behavior; conversely, by restricting

原文を表示

The Recurrent Transformer:Greater Effective Depth and Efficient DecodingCostin-Andrei Oncescu∗ Depen Morwani Samy JelassiAlexandru Meterez Mujin Kwun Sham KakadeHarvard UniversityAbstractTransformers process tokens in parallel but are temporally shallow: at position t, each layer attends tokey–value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrentmodels offer unbounded temporal depth but suffer from optimization instability and historically underutilizemodern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layerattends to key–value pairs computed off its own activations, yielding layerwise recurrent memory while preservingstandard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventionalTransformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimizationinstability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near 1 becausekeys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematicalcomputation while reducing HBM traffic from Θ(N 2) to Θ(N log N ), increasing effective arithmetic intensityto Θ(N/ log N ) for sequence length N . On 150M and 300M parameter C4 pretraining, Recurrent Transformersimprove cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewerlayers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memoryfootprint and inference latency. Code is available at https://github.com/geniucos/recurrent-transformer1 IntroductionTransformers [Vaswani et al., 2017] are highly effective sequence models, but their computation across positionsis structurally shallow: within each layer, position t attends to key–value pairs computed from the previous layerembeddings, allowing essentially at most one interaction per layer between any pair of positions. A growing bodyof theory studies the fundamental limitations implied by bounded depth in attention models, including circuit-complexity characterizations of what low-depth Transformers can and cannot represent [Merrill et al., 2022, Liuet al., 2023]. These perspectives motivate architectures that achieve greater effective depth.We introduce the Recurrent Transformer (RT), a simple modification of how key–value pairs are computedthat makes each layer temporally recurrent. In a standard Transformer, at layer ℓ, the key–value pair at position tis computed from the layer-(ℓ − 1) representation at that position and can then be attended to by later positionst′ > t. In the Recurrent Transformer, by contrast, the key–value pair at position t in layer ℓ is computed fromthat position’s output at layer ℓ, rather than from its layer-(ℓ − 1) representation. Consequently, a later positiont < t′ at layer ℓ attends to a representation at t that already reflects layer ℓ attention and MLP computation.Importantly, Recurrent Transformer performs this recurrence separately within each layer, so each layer maintains itsown key–value memory. This differs from the Feedback Transformer [Fan et al., 2020], which uses a shared memoryacross layers, and this layerwise separation is a key reason why our architecture can be implemented efficiently.We motivate Recurrent Transformer’s design through lenses of representation, optimization and computationalefficiency:(i) Representational perspective. Recurrent Transformers retains per-token key–value memory just like aTransformer, but increase the space of computations that can be expressed within a single layer by allowinglater positions to attend to representations that have already undergone attention and MLP processing. Undermild assumptions, Recurrent Transformers can emulate standard Transformer behavior; conversely, by restricting∗Correspondence to: concescu@g.harvard.edu1arXiv:2604.21215v1 [cs.LG] 23 Apr 2026

この記事をシェア

Allen AI (AI2)重要度42026年6月25日 17:00

ハイブリッドモデルはどのトークンをより正確に予測するか？

TLDR AI2026年6月26日 09:00

研究科学者の就職活動から得た驚くべき教訓（11 分読）

TLDR AI2026年6月26日 09:00

ツール使用型 LLM エージェントの脆弱性評価手法「RHB」を発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む