MIT ML News·2026年4月9日 22:00·約9分で読める

学習中にAIモデルをより軽量かつ高速にする新技術

#モデル圧縮 #効率的学習 #state-space-models #制御理論 #計算効率化 #研究手法

TL;DR

MITなどの研究チームは、学習中にAIモデルを圧縮する新手法「CompreSSM」を開発し、従来のトレードオフを回避して効率的なモデル構築を可能にした。

AI深層分析2026年4月9日 23:42

重要/ 5段階

深度40%

キーポイント

学習中のモデル圧縮

従来は学習後に圧縮していたが、CompreSSMは学習プロセス中に不要なコンポーネントを早期に除去することで、効率的なモデル構築を実現する。

数学的手法の応用

制御理論から借用したHankel特異値を使用し、学習の初期段階（約10%）で各内部状態の重要性を評価・ランク付けし、不要部分を特定する。

顕著な性能向上

画像分類ベンチマークで、圧縮モデルは同等の精度を維持しながら、学習速度が最大1.5倍向上し、状態次元を約1/4に削減できた。

アーキテクチャの対象

言語処理、音声生成、ロボティクスなど幅広い応用を持つstate-space models（状態空間モデル）を対象としている。

影響分析・編集コメントを表示

影響分析

この手法はAIモデル開発のパラダイムを変える可能性があり、計算コストと環境負荷の大幅削減につながる。特に大規模モデル開発において、リソース制約のある組織や研究者にも高度なAI開発の門戸を広げる重要な進展と言える。

編集コメント

学習中の動的圧縮という発想の転換が興味深く、実用化されればAI開発の経済性と持続可能性に大きく貢献する可能性がある。学術的な革新性と実用性のバランスが取れた研究と言える。

大規模な人工知能（AI）モデルの学習は、単にお金がかかるだけでなく、時間、エネルギー、計算資源においても多大なコストを要します。従来、より小さく高速なモデルを得るには、まず大規模なモデルを学習してからそれを切り詰めるか、最初から小さなモデルを学習して性能の低下を受け入れるかのいずれかが必要でした。

MITのコンピュータサイエンス・人工知能研究所（CSAIL）、マックス・プランク知性システム研究所、ヨーロッパ学習・知性システム研究所（ELLIS）、ETH Zurich、Liquid AIの研究者たちは、このトレードオフを完全に回避する新しい手法を開発しました。これは学習後に圧縮するのではなく、学習中にモデルを圧縮する手法です。

この「CompreSSM」と呼ばれる技術は、言語処理から音声生成、ロボティクスに至るまで様々なアプリケーションを支う「状態空間モデル（state-space models）」というAIアーキテクチャのファミリーを対象としています。制御理論から数学的ツールを借用することで、研究者たちは不要なコンポーネントを学習プロセスの初期段階で外科的に除去する前に、モデルのどの部分が貢献しており、どの部分が不要な重荷であるかを特定することができます。

「これは本質的に、モデルが学習する際に小さく、高速になるようにする技術です」と、電気工学・コンピュータサイエンスの博士課程学生でありCSAIL准メンバー、論文の筆頭著者であるMakram Chahineは述べています。「学習过程中、彼らは自身の発展に役立たない部分を排除しているのです。」

この手法の鍵となる洞察は、これらのモデル内の異なるコンポーネントの相対的な重要度が、学習プロセスの初期段階で驚くほど早く安定するという点です。各内部状態がモデル全体の挙動にどの程度寄与しているかを測定する「ハンケル特異値（Hankel singular values）」という数学的量を用いることで、チームは学習プロセスの約10％が完了した時点で、どの次元が重要で、どの次元が無視できるかを確実に順位付けできることを示しました。この順位付けが確立されると、重要性の低いコンポーネントは安全に破棄でき、残りの90％の学習は、はるかに小さなモデルの速度で進行します。

「この研究が興奮を呼ぶのは、圧縮を後回しにする処理から、学習プロセスそのものへと転換する点にあります」と、MIT教授でありCSAIL所長でもあるシニア著者のダニエラ・ラスは述べています。「大きなモデルを学習した後に、それをどのように小さくするかを考えるのではなく、CompreSSMはモデルが学習する過程で自身の効率的な構造を自ら発見させるのです。これは、AIシステムを構築する方法について根本的に異なる視点を提供するものです。」

結果は顕著です。画像分類のベンチマークでは、圧縮モデルはフルサイズのモデルとほぼ同等の精度を維持しながら、最大1.5倍高速に学習できました。元の状態次元のおよそ4分の1に圧縮されたモデルは、CIFAR-10ベンチマークで85.7%の精度を達成しました。これは、同じ小さなサイズからゼロから学習させたモデルの81.8%を上回る結果です。最も広く使用されている状態空間アーキテクチャの1つであるMambaでは、この手法は約4倍の学習速度向上を実現し、128次元のモデルを競争力のあるパフォーマンスを維持しながら約12次元に圧縮しました。

「より大きなモデルの性能が得られます。なぜなら、ウォームアップフェーズ中に複雑なダイナミクスの大部分を捉え、その後最も有用な状態のみを保持するからです」とChahineは述べています。「モデルは、最初から小さなモデルを学習させるよりも高いレベルで動作し続けることができます。」

CompreSSMが既存のアプローチと異なる点は、その理論的基盤にあります。従来のプルーニング手法はフルモデルを学習させた後、事後にパラメータを削除するため、大きなモデルの学習に必要な計算コスト全体を支払うことになります。もう一つの人気のある手法である知識蒸留は、大きな「教師」モデルを完成まで学習し、その上で2番目に小さい「学生」モデルを学習する必要があり、実質的に学習作業が倍増します。CompreSSMは、中間段階で情報に基づいた圧縮判断を行うことで、これらのコストの両方を回避します。

チームは、CompreSSMを2つの代替手法と直接比較するベンチマークを実施しました。コンパクトな状態空間モデルの構築を促す最近提案されたスペクトル手法であるHankel核正則化と比較すると、CompreSSMは40倍以上高速でありながら、より高い精度を達成しました。正則化アプローチは、各勾配ステップで高コストな固有値計算を必要としたため、訓練速度を約16倍遅らせました。さらに、その結果得られたモデルは性能が劣りました。CIFAR-10における知識蒸留との比較では、CompreSSMは高度に圧縮されたモデルにおいて明確な優位性を示しました。状態次元が小さい場合、蒸留されたモデルは精度の大幅な低下が見られた一方、CompreSSMで圧縮されたモデルはほぼ完全な性能を維持しました。また、蒸留では各訓練ステップで教師モデルと生徒モデルの両方への順伝播が必要となるため、小さな生徒モデルであっても、フルサイズのベースラインよりも訓練が遅くなりました。

研究者たちは、ウェールの定理（Weyl's theorem）を適用することで、個々のモデル状態の重要性が訓練中に滑らかに変化することを数学的に証明し、それらの状態の相対的な順位が安定していることを実証しました。これらの知見を組み合わせることで、実践者は、初期段階で無視できると識別された次元が、後になって突然重要になることはないという確信を持てます。

この手法には、実用的な安全ネットも備わっています。圧縮ステップによって予期せぬ性能低下が生じた場合、実務者は以前に保存されたチェックポイントに戻ることができます。「これは、直感的でないエネルギー閾値を定義する必要のある状況ではなく、性能の観点でどれだけのコストを支払うかという制御権を人々に与えます」と、シャヒーンは説明します。

この手法にはいくつかの実用的な限界があります。CompreSSM は、内部状態の次元と全体的な性能の間に強い相関関係が見られるモデルで最も効果的に機能します。この特性はタスクやアーキテクチャによって異なります。この手法は、特に状態の大きさと表現力の関係が最も強いマルチ入力・マルチ出力（MIMO）モデルにおいて効果的です。一方、チャネルごとの単一入力・単一出力アーキテクチャでは、そもそもこれらのモデルが状態の次元変化に対して感度が低いため、得られる恩恵はより限定的です。

この理論は線形時不変システムに最も明確に適用されますが、チームはますます人気を集めている入力依存型・時変アーキテクチャ向けの拡張も開発しています。さらに、状態空間モデルのファミリーは従来のトランスフォーマーに代わる選択肢として関心が高まっている線形注意機構などのアーキテクチャにも及ぶため、適用範囲の潜在的可能性は広大です。

カヒンと彼の共同研究者たちは、この研究を一歩として捉えています。チームはすでにMambaのような線形時間変化するシステムへの拡張を実証しており、今後の方向性としては、CompreSSMを線形アテンション・メカニズムで使用される行列値動的システムへとさらに押し広げ、今日の最大規模のAIシステムの基盤となっているトランスフォーマー・アーキテクチャにこの手法を近づけることが目指されています。

「これは最初のステップでなければなりませんでした。なぜなら、ここで理論は整っており、アプローチも原理的に一貫性を保つことができるからです」とカヒンは語ります。「これは、現在業界で人々が使用している他のアーキテクチャへと拡張するための踏み台となります。」

「カヒンと彼の同僚たちの仕事は、現代のステート・スペース・モデル（SSMs）における圧縮について、興味深く理論的に裏付けられた視点を提供しています」と、本研究に関与しなかったELLIS Institute Tübingenの主任研究者でありMPI for Intelligent Systemsの独立グループリーダーであるアントニオ・オルヴィエートは述べています。「この手法は、これらのモデルのステート次元を学習中に効果的に削減できること、そして制御理論的な視点がこの手続きを成功裏に導き得ることを示す証拠を提供しています。この研究は将来の研究への新たな道を開き、提案されたアルゴリズムは大規模なSSMベースのモデルを事前学習する際の標準的なアプローチとなる可能性があります。」

この研究は、International Conference on Learning Representations 2026（学習表現に関する国際会議）の論文として採択され、今月下旬に発表される予定です。この研究は、Max Planck ETH Center for Learning Systems（マックス・プランクETH学習システムセンター）、Hector Foundation、Boeing、およびU.S. Office of Naval Research（米国海軍研究局）から部分的な支援を受けています。

原文を表示

Training a large artificial intelligence model is expensive, not just in dollars, but in time, energy, and computational resources. Traditionally, obtaining a smaller, faster model either requires training a massive one first and then trimming it down, or training a small one from scratch and accepting weaker performance.

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now developed a new method that sidesteps this trade-off entirely, compressing models during training, rather than after.

The technique, called CompreSSM, targets a family of AI architectures known as state-space models, which power applications ranging from language processing to audio generation and robotics. By borrowing mathematical tools from control theory, the researchers can identify which parts of a model are pulling their weight and which are dead weight, before surgically removing the unnecessary components early in the training process.

"It's essentially a technique to make models grow smaller and faster as they are training," says Makram Chahine, a PhD student in electrical engineering and computer science, CSAIL affiliate, and lead author of the paper. "During learning, they're also getting rid of parts that are not useful to their development."

The key insight is that the relative importance of different components within these models stabilizes surprisingly early during training. Using a mathematical quantity called Hankel singular values, which measure how much each internal state contributes to the model's overall behavior, the team showed they can reliably rank which dimensions matter and which don't after only about 10 percent of the training process. Once those rankings are established, the less-important components can be safely discarded, and the remaining 90 percent of training proceeds at the speed of a much smaller model.

"What's exciting about this work is that it turns compression from an afterthought into part of the learning process itself,” says senior author Daniela Rus, MIT professor and director of CSAIL. “Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns. That's a fundamentally different way to think about building AI systems.”

The results are striking. On image classification benchmarks, compressed models maintained nearly the same accuracy as their full-sized counterparts while training up to 1.5 times faster. A compressed model reduced to roughly a quarter of its original state dimension achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to just 81.8 percent for a model trained at that smaller size from scratch. On Mamba, one of the most widely used state-space architectures, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.

"You get the performance of the larger model, because you capture most of the complex dynamics during the warm-up phase, then only keep the most-useful states," Chahine says. "The model is still able to perform at a higher level than training a small model from the start."

What makes CompreSSM distinct from existing approaches is its theoretical grounding. Conventional pruning methods train a full model and then strip away parameters after the fact, meaning you still pay the full computational cost of training the big model. Knowledge distillation, another popular technique, requires training a large "teacher" model to completion and then training a second, smaller "student" model on top of it, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed compression decisions mid-stream.

The team benchmarked CompreSSM head-to-head against both alternatives. Compared to Hankel nuclear norm regularization, a recently proposed spectral technique for encouraging compact state-space models, CompreSSM was more than 40 times faster, while also achieving higher accuracy. The regularization approach slowed training by roughly 16 times because it required expensive eigenvalue computations at every single gradient step, and even then, the resulting models underperformed. Against knowledge distillation on CIFAR-10, CompressSM held a clear advantage for heavily compressed models: At smaller state dimensions, distilled models saw significant accuracy drops, while CompreSSM-compressed models maintained near-full performance. And because distillation requires a forward pass through both the teacher and student at every training step, even its smaller student models trained slower than the full-sized baseline.

The researchers proved mathematically that the importance of individual model states changes smoothly during training, thanks to an application of Weyl's theorem, and showed empirically that the relative rankings of those states remain stable. Together, these findings give practitioners confidence that dimensions identified as negligible early on won't suddenly become critical later.

The method also comes with a pragmatic safety net. If a compression step causes an unexpected performance drop, practitioners can revert to a previously saved checkpoint. "It gives people control over how much they're willing to pay in terms of performance, rather than having to define a less-intuitive energy threshold," Chahine explains.

There are some practical boundaries to the technique. CompreSSM works best on models that exhibit a strong correlation between the internal state dimension and overall performance, a property that varies across tasks and architectures. The method is particularly effective on multi-input, multi-output (MIMO) models, where the relationship between state size and expressivity is strongest. For per-channel, single-input, single-output architectures, the gains are more modest, since those models are less sensitive to state dimension changes in the first place.

The theory applies most cleanly to linear time-invariant systems, although the team has developed extensions for the increasingly popular input-dependent, time-varying architectures. And because the family of state-space models extends to architectures like linear attention, a growing area of interest as an alternative to traditional transformers, the potential scope of application is broad.

Chahine and his collaborators see the work as a stepping stone. The team has already demonstrated an extension to linear time-varying systems like Mamba, and future directions include pushing CompreSSM further into matrix-valued dynamical systems used in linear attention mechanisms, which would bring the technique closer to the transformer architectures that underpin most of today's largest AI systems.

"This had to be the first step, because this is where the theory is neat and the approach can stay principled," Chahine says. "It's the stepping stone to then extend to other architectures that people are using in industry today."

"The work of Chahine and his colleagues provides an intriguing, theoretically grounded perspective on compression for modern state-space models (SSMs)," says Antonio Orvieto, ELLIS Institute Tübingen principal investigator and MPI for Intelligent Systems independent group leader, who wasn't involved in the research. "The method provides evidence that the state dimension of these models can be effectively reduced during training and that a control-theoretic perspective can successfully guide this procedure. The work opens new avenues for future research, and the proposed algorithm has the potential to become a standard approach when pre-training large SSM-based models."

The work, which was accepted as a conference paper at the International Conference on Learning Representations 2026, will be presented later this month. It was supported, in part, by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.

この記事をシェア

Simon Willison Blog★32026年3月27日 01:21

量子化の基礎から解説

Sam Roseが大規模言語モデルの量子化の仕組みをインタラクティブな記事で解説し、浮動小数点数のバイナリ表現についても視覚的に説明している。

InfoQ★32026年3月26日 20:21

グリーンIT：AIの環境への影響を軽減する方法

AI研究者のLudi Akue氏は、AIが環境に与える影響（大量のエネルギー消費、GPUの短寿命など）を軽減するため、モデル圧縮や量子化などの技術を提案した。

TechCrunch AI★32026年3月26日 05:38

Googleが新AIメモリ圧縮アルゴリズム「TurboQuant」を発表、ネットでは「Pied Piper」と話題に

GoogleがAIの作業メモリを最大6倍圧縮する新アルゴリズム「TurboQuant」を発表したが、現時点では実験段階である。

ニュース一覧に戻る元記事を読む