数学には思考時間、日常知識には記憶が必要:新Transformerアーキテクチャが両方を実現を目指す
ドイツの研究チームは、Transformerモデルが問題について考える回数を自律的に決定し、追加メモリと組み合わせることで、数学問題においてより大規模なモデルを上回る性能を発揮する新しいアーキテクチャを開発した。
キーポイント
思考回数の自律決定
新しいTransformerアーキテクチャは、モデル自身が問題に対して「何回考えるか」を動的に決定する機能を備えている。
追加メモリとの統合
この自律的思考メカニズムは、追加的なメモリ(日常知識の記憶)と組み合わせて設計されている。
数学問題での優位性
このアプローチにより、より大規模なモデルと比較して、数学問題の解決において優れた性能を発揮することが実証された。
二つの能力の統合
アーキテクチャは、数学問題に必要な「思考時間」と日常知識に必要な「記憶」の両方を提供することを目指している。
影響分析・編集コメントを表示
影響分析
この研究は、Transformerアーキテクチャの根本的な改良を示しており、計算効率と推論能力の両立という核心課題に取り組んでいる。大規模モデルへの依存を減らしつつ特定領域(数学推論)で優れた性能を発揮できるため、AIモデルの実用化と展開コストに影響を与える可能性が高い。
編集コメント
Transformerの基本設計にメタ認知的な要素を組み込むという発想が興味深く、モデルの効率性と能力のバランスを改善する新たな方向性を示している。実証結果が具体的な数値で示されると更に説得力が増すだろう。

ドイツの研究チームは、Transformerモデル自身に、問題について「何回考えるか」を決定させています。追加のメモリと組み合わせるこの手法により、数学問題においてより大規模なモデルを性能で上回りました。
この記事「Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both」は、The Decoderで最初に公開されました。
原文を表示
A German research team lets Transformer models decide for themselves how many times they think about a problem. Combined with additional memory, the approach clearly outperforms larger models on math problems.
Language models can think step by step using chain-of-thought prompting, but each intermediate step requires additional tokens. Looped transformers offer an alternative: they run the same computation block multiple times on their internal representations without outputting intermediate steps as text. This saves parameters but costs storage capacity, because the model has fewer unique weights to store knowledge in.
A research team from the Lamarr Institute, Fraunhofer IAIS, and the University of Bonn wanted to find out whether this tradeoff can be resolved. Their architecture combines two mechanisms: adaptive looping, where each transformer layer uses a learned halt mechanism to decide how many times it repeats its computation block, and learned memory banks that provide additional knowledge.
The base architecture is a decoder-only transformer with 12 layers and about 200 million parameters, trained on 14 billion tokens from the deduplicated FineWeb Edu dataset. The looped variants allow each layer up to 3, 5, or 7 iterations. The memory banks include 1,024 local slots per layer and 512 global, shared slots, adding roughly 10 million extra parameters, according to the study.
A conventional transformer (left) passes data through each layer once. The loop model (center) can repeat each block multiple times. The combined model (right) also taps into local and global memory. | Image: Frey et al.
Looping boosts math, memory fills knowledge gaps
The results show that letting a model repeat its calculations up to three times makes it significantly better at math. The looped model scores 22 percent higher than the base model without loops. Demanding subcategories like Precalculus (31 percent improvement) and Intermediate Algebra (26 percent) benefit the most. For tasks that require everyday knowledge—like questions about social situations or physical intuition—the loops barely help. With additional iterations, performance actually drops slightly.
To put this in perspective, the researchers compared their 12-layer model with triple loops against a conventional 36-layer model that uses the same computational effort but no loops. Despite having only a third of the layers, the looped model performs 6.4 percent better on math benchmarks. Loops are more efficient than additional layers when it comes to mathematical reasoning, the researchers write.
The memory banks solve a different problem. Everyday knowledge can't be generated through repeated thinking: it has to be stored. The memory banks provide exactly this extra capacity, closing part of the knowledge gap that loops alone can't bridge. Together, the model improves by another 4.2 percent on math and two percent on everyday knowledge tasks compared to the variant without memory, according to the study.
Early layers stay frugal, late layers work harder
Even though the model doesn't get an explicit penalty for the number of loop passes, a specialization shows up on its own: early layers learn to repeat their computation blocks only minimally and barely touch memory. Late layers, on the other hand, loop more intensively and tap into the memory banks more often.
Late layers (dark blue) repeat their calculations significantly more often during training than early layers (yellow). The bar chart in the middle shows the distribution at the end of training. | Image: Frey et al.
This fits with earlier research showing that early transformer layers encode local syntactic patterns, while later layers handle more complex semantic and logical operations. Simple computations don't benefit from extra iterations, but the more sophisticated operations in deeper layers do.
There's also a clear turning point during training: early on, the models barely use their loops even though they could. Only once the model gets good enough at understanding and predicting language does it start actually repeating its calculations. According to the researchers, this threshold kicks in at nearly the same point across all loop configurations. The model has to build up basic language skills first before it can benefit from repeated thinking.
The models only start actively using their loops once they reach a certain level of fluency. This threshold is nearly identical across all configurations. Color indicates training progress from dark blue (start) to yellow (end). | Image: Frey et al.
More computation demands more knowledge
The researchers see their results as evidence of a fundamental division of labor within transformers. Feed-forward layers act as a kind of memory for factual associations, while attention layers route and manipulate information. Looping improves routing but can't make up for insufficient storage capacity.
The fact that layers looping more often also pull more from memory supports this reading: loops and memory complement each other. More computation requires more facts.
Layers that loop more frequently also access more local memory (center). Global memory access, on the other hand, is evenly distributed across all layers (right). Both gate values increase during training, with local gates showing more variation (left). | Image: Frey et al.
The authors point out some limitations: the experiments ran at a relatively small scale, about 200 million parameters and 14 billion training tokens. Whether the results hold for models with several billion parameters—which already have considerable built-in capacity—remains to be seen.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み