Synced Review·2025年4月11日 23:43·約7分で読める

DeepSeek、次世代R2モデルを示唆しSPCTを用いた推論スケーリングの新手法を公開

#LLM #Reasoning #Reinforcement Learning #DeepSeek #Inference-Time Scaling

TL;DR

DeepSeek AI が次世代モデル R2 の登場を示唆するとともに、推論段階でのスケーラビリティを革新する SPCT 技術と RL と LLM の相乗効果に関する詳細な研究論文を発表した。

AI深層分析2026年5月3日 02:16

重要/ 5段階

深度40%

キーポイント

SPCT による推論時スケーリングの革新

一般報酬モデル（GRM）が動的に原則や批判を生成する新手法「Inference-Time Scaling for Generalist Reward Modeling」を導入し、拒否微調整とルールベースオンライン強化学習を活用して性能を向上させる。

o1 以降の推論パラダイムへの移行

事前学習から事後学習（特に推論段階）へ焦点が移っており、OpenAI の o1 や DeepSeek の R1 シリーズのように、計算リソースを投じて思考時間を延長することで複雑な問題解決能力を高めるアプローチが主流化している。

LLM と強化学習の相乗効果（乗算関係）

清華大学の Wu Yi 氏によると、理解と記憶は事前学習で構築され、強化学習が意思決定を最適化する「乗算関係」にあり、両者の融合が完全な知能エージェント実現への鍵となる。

次世代モデル R2 の登場示唆

同社はこの技術的進展に伴い、直ちに次世代モデル「R2」の発表を予告し、AI 界隈に大きな期待を持たせている。

SPCTによる推論時のスケーラビリティ向上

DeepSeekと清華大学の研究者は、自己原理批評調整（SPCT）手法を提案し、推論時に一般化された報酬モデルのスケーラビリティと汎用性を改善しました。

メタ報酬モデルによる投票プロセスの最適化

複数のサンプルから原則と批評を生成し、それらの正誤を識別するメタ報酬モデル（Meta RM）を用いた投票プロセスを導入することで、推論時のスケーリング性能を強化しています。

強化学習の拡張則における課題への対応

従来のデータと計算リソースの増加に依存する拡張則とは異なり、報酬の希薄さや複雑な要因に影響される強化学習の拡張則に対し、SPCTが新たなアプローチを提供しています。

影響分析・編集コメントを表示

影響分析

このニュースは、LLM の進化が単なるパラメータ数の増加から、推論プロセスの最適化と強化学習との深い統合へとパラダイムシフトしていることを明確に示しています。DeepSeek の新技術は、モデルがより複雑な課題に対して体系的かつ長期的な計画を立てられるようになり、実社会での応用範囲を大幅に広げる可能性があります。

編集コメント

推論段階での計算リソース活用という、o1 の成功に続く重要なトレンドを DeepSeek が技術的に具体化し、次世代モデルで実装する姿勢は業界全体のパラダイムシフトを加速させるでしょう。

大規模言語モデルの分野において主要なプレイヤーである DeepSeek AI は、推論段階における一般化報酬モデル（GRM）のスケーラビリティを向上させることを目的とした新技術の詳細を記した研究論文を最近公開しました。同時に、同社は次世代モデル R2 の間もなく登場を示唆し、AI コミュニティ内で期待を高めています。

「Inference-Time Scaling for Generalist Reward Modeling」と題されたこの論文は、GRM が動的に原則と批判を生成することで報酬生成を最適化できる新たな手法を紹介しています。これは拒否微調整（rejection fine-tuning）およびルールベースのオンライン強化学習 [1-1] を通じて実現されています。

この進展は、OpenAI の o1 などのモデルの登場に伴い、大規模言語モデル（LLM）のスケーリングのパラダイムが事前学習段階から事後学習、特に推論段階へと移行している時期に起こっています。このアプローチは、強化学習（トレーニング中の計算リソース）の増加と、より広範な「思考時間」（テスト中の計算リソース）を活用してモデル性能を継続的に向上させるものです。特筆すべきは、o1 がユーザーへの回答前に長大な内部思考連鎖を生成し、推論プロセスを洗練させ、異なる戦略を探求し、自身の誤りを特定する点です。

DeepSeek 独自の R1 シリーズのモデルは、教師あり微調整に依存しない純粋な強化学習トレーニングが、大規模言語モデルの推論能力において飛躍的な向上をもたらす可能性をさらに裏付けています。

LLM の基本的な「次トークン予測」メカニズムは膨大な知識を提供する一方で、深い計画能力や長期的結果の予測能力に欠けることが多く、短期的な意思決定に陥りやすいという課題があります。強化学習はこの重要な補完要素として機能し、LLM に「内部世界モデル」をもたらします。これにより、異なる推論経路の潜在的な結果をシミュレーションし、これらの経路の質を評価してより優れた解決策を選択することが可能となり、最終的にはより体系的な長期的計画を実現します。LLM と強化学習の相乗効果は、複雑な問題解決能力を向上させる鍵としてますます認識されています。

清華大学学際情報科学研究所（IIIS）の助教である吴一氏は、最近のポッドキャストで LLM と強化学習の関係性を「乗算関係」と例えました。強化学習は意思決定において卓越していますが、本質的に理解力を欠いています。理解力の構築は事前学習済みモデルに依存しており、その上で強化学習がさらに意思決定能力を最適化します。この「乗算関係」は、事前学習の段階で十分な理解・記憶・論理的推論の基盤が築かれて初めて、強化学習が完全な知能エージェントを創出する可能性を最大限に引き出すことができることを示唆しています [1-2]。

「Reinforcement Learning Enhanced LLMs: A Survey」と題された包括的な調査論文は、強化学習（RL）を用いて大規模言語モデル（LLM）を訓練する際の典型的な 3 つのステッププロセスを概説しています：

報酬モデルのトレーニング：ファインチューニングの前に、人間の嗜好を近似し、異なる LLM の出力を評価するために、報酬モデル（または報酬関数）がトレーニングされます。

嗜好に基づくファインチューニング：各ファインチューニング反復において、大規模言語モデルは与えられた指示に対して複数の応答を生成し、それぞれの応答はトレーニング済みの報酬モデルを用いて採点されます。

ポリシー最適化：強化学習の最適化技術が用いられ、嗜好スコアに基づいてモデルの重みを更新し、応答生成の改善を目指します。

強化学習を統合することで、大規模言語モデルは変化する嗜好スコアに基づいて動的に調整できるようになり、単一の事前に決定された回答という限界を超えたものとなります。

DeepSeek の SPCT：LLM における RL のスケーリング課題への対応

ポストトレーニングにおける強化学習の成功が LLM パフォーマンスを向上させる画期的な突破であるにもかかわらず、強化学習アルゴリズム自体には依然として大幅な改善の余地があり、強化学習の「スケーリング法則」はまだ萌芽段階にあります。

従来のスケーリング法則がモデル性能を向上させるためにデータと計算リソースの増加に焦点を当てるのに対し、強化学習におけるスケーリング法則は、サンプル処理量、モデルパラメータサイズ、トレーニング環境の複雑さなど、より複合的な要因の影響を受けます。

強化学習のスケーリングにおける主要な障壁の一つは報酬の希薄性です。報酬モデルは重要な構成要素であり、正確な報酬信号を生成することが何よりも重要です。報酬モデルにおいて一般化性と連続性の両方を達成することは、重点的な課題となっています。

DeepSeek と清華大学の研究者らは、最近の研究でこの課題に取り組み、推論時における報酬モデルのスケーラビリティと一般化可能性を探求しました。彼らが提案する自己原理批評調整（Self-Principled Critique Tuning: SPCT）手法は、推論時の一般的な報酬モデリングのスケーラビリティ向上を目指しています。

SPCT アプローチには 2 つの主要な段階が含まれます：

拒否微調整 (Rejection Fine-Tuning): これはコールドスタートとして機能し、GRM が正しい形式と種類の生成原理および批評に適応できるようにします。

ルールベースオンライン強化学習 (Rule-Based Online RL): この段階は、原理および批評の生成をさらに最適化します。

効果的な推論時のスケーリングを実現するために、研究者たちは並列サンプリングを採用して計算リソースの利用率を最大化しました。複数回のサンプリングを行うことで、DeepSeek-GRM は異なる原則と批判のセットを生成し、投票を通じて最終的な報酬を選択します。さらに、この投票プロセスをガイドし、スケーリング性能をさらに向上させるためにメタ報酬モデル（Meta RM）が訓練されています。Meta RM は、DeepSeek-GRM が生成した原則と批判の正誤を特定するために設計されたポイント・ツー・ポイントのスカラー報酬モデルです。

実験結果は、SPCT が GRM の品質とスケーラビリティを大幅に向上させることを示しており、ドメインバイアスを伴わずに複数の包括的な RM ベンチマークにおいて既存の手法やモデルを上回りました。

今後の展望：DeepSeek R2 の登場へ向けて

研究論文は報酬モデリングと推論時のスケーリングにおける進展に焦点を当てていますが、DeepSeek の R1 シリーズへの言及および暗黙的な進化の文脈から、同社が次世代モデルである R2 を積極的に開発中であることが示唆されます。推論能力の向上のために純粋な強化学習（Reinforcement Learning）を重視する DeepSeek にとって、今回のスケーラブルな報酬モデルに関する最新研究で得られた知見を R2 が取り入れ、さらに発展させることは非常に期待されています。

AI コミュニティは、DeepSeek R2 に関するさらなる発表を注視しており、同社が強化学習と推論最適化における革新的なアプローチを活用して大規模言語モデルの能力の限界をどこまで押し広げるのかに期待を寄せています。スケーラブルな報酬モデルへの焦点は、次期フラッグシップモデルにおいて、より洗練された自己評価および改善メカニズムへの重点的な取り組みを示唆しています。

論文「Inference-Time Scaling for Generalist Reward Modeling」は arXiv に掲載されています。

記事「DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT」は、Synced 上で最初に発表されました。

原文を表示

DeepSeek AI, a prominent player in the large language model arena, has recently published a research paper detailing a new technique aimed at enhancing the scalability of general reward models (GRMs) during the inference phase. Simultaneously, the company has hinted at the imminent arrival of its next-generation model, R2, building anticipation within the AI community.

The paper, titled “Inference-Time Scaling for Generalist Reward Modeling” introduces a novel method that allows GRMs to optimize reward generation by dynamically producing principles and critiques. This is achieved through rejection fine-tuning and rule-based online reinforcement learning [1-1].

This development comes at a time when the paradigm for scaling LLMs is shifting from the pre-training stage to post-training, particularly the inference phase, following the emergence of models like OpenAI’s o1. This approach leverages increased reinforcement learning (computational effort during training) and more extensive “thinking time” (computational effort during testing) to continually improve model performance. Notably, o1 generates a lengthy internal chain of thought before responding to users, refining its reasoning process, exploring different strategies, and identifying its own errors.

DeepSeek’s own R1 series of models has further validated the potential of pure reinforcement learning training (without relying on supervised fine-tuning) to achieve significant leaps in LLM reasoning capabilities.

The fundamental “next token prediction” mechanism of LLMs, while providing vast knowledge, often lacks deep planning and the ability to predict long-term outcomes, making them susceptible to short-sighted decisions. Reinforcement learning serves as a crucial complement, providing LLMs with an “Internal World Model.” This enables them to simulate the potential outcomes of different reasoning paths, evaluate the quality of these paths, and select superior solutions, ultimately leading to more systematic long-term planning. The synergy between LLMs and RL is increasingly recognized as key to enhancing the ability to solve complex problems.

Wu Yi, an assistant professor at Tsinghua’s Institute for Interdisciplinary Information Sciences (IIIS), likened the relationship between LLMs and reinforcement learning to a “multiplicative relationship” in a recent podcast. While reinforcement learning excels in decision-making, it inherently lacks understanding. The construction of understanding relies on pre-trained models, upon which reinforcement learning can then further optimize decision-making capabilities. This “multiplicative relationship” suggests that only when a strong foundation of understanding, memory, and logical reasoning is built during pre-training can reinforcement learning fully unlock its potential to create a complete intelligent agent [1-2].

A comprehensive survey paper titled “Reinforcement Learning Enhanced LLMs: A Survey” outlines the typical three-step process of using RL to train LLMs:

Reward Model Training: Before fine-tuning, a reward model (or reward function) is trained to approximate human preferences and evaluate different LLM outputs.

Preference-Based Fine-Tuning: In each fine-tuning iteration, the large language model generates multiple responses to a given instruction, and each response is scored using the trained reward model.

Policy Optimization: Reinforcement learning optimization techniques are used to update the model’s weights based on the preference scores, aiming to improve response generation.

Integrating reinforcement learning allows large language models to dynamically adjust based on varying preference scores, moving beyond the limitations of a single, pre-determined answer.

DeepSeek’s SPCT: Addressing the Scaling Challenges of RL for LLMs

Despite the success of reinforcement learning in post-training as a breakthrough for enhancing LLM performance, reinforcement learning algorithms themselves still have significant room for improvement, and the “Scaling Laws” of reinforcement learning are still in their nascent stages.

Unlike traditional scaling laws that focus on increasing data and compute to improve model performance, the scaling laws for reinforcement learning are influenced by more complex factors, including sample throughput, model parameter size, and the complexity of the training environment.

A major hurdle in the scaling of reinforcement learning is reward sparsity. The reward model is a critical component, and generating accurate reward signals is paramount. Achieving both generalization and continuity in reward models is a key focus.

DeepSeek and Tsinghua researchers addressed this challenge in their recent work by exploring the scalability and generalization of reward models at inference time. Their proposed Self-Principled Critique Tuning (SPCT) method aims to improve the scalability of general reward modeling during inference.

The SPCT approach involves two key stages:

Rejection Fine-Tuning: This serves as a cold start, enabling the GRM to adapt to generating principles and critiques in the correct format and type.

Rule-Based Online RL: This stage further optimizes the generation of principles and critiques.

To achieve effective inference-time scaling, the researchers employed parallel sampling to maximize computational utilization. By sampling multiple times, the DeepSeek-GRM can generate different sets of principles and critiques and select the final reward through voting. Furthermore, a meta-reward model (Meta RM) is trained to guide the voting process, further enhancing scaling performance. The Meta RM is a point-to-point scalar reward model designed to identify the correctness of the principles and critiques generated by the DeepSeek-GRM.

Experimental results demonstrated that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models on multiple comprehensive RM benchmarks without significant domain bias.

Looking Ahead: DeepSeek R2 on the Horizon

While the research paper focuses on advancements in reward modeling and inference-time scaling, the mention of DeepSeek’s R1 series and the implicit progression suggests that the company is actively developing its next-generation model, R2. Given DeepSeek’s emphasis on pure reinforcement learning for enhancing reasoning, it is highly anticipated that R2 will incorporate and build upon the insights gained from this latest research on scalable reward models.

The AI community will be keenly watching for further announcements regarding DeepSeek R2, eager to see how the company leverages its innovative approaches to reinforcement learning and inference optimization to push the boundaries of large language model capabilities. The focus on scalable reward models hints at a potential emphasis on even more sophisticated self-evaluation and improvement mechanisms within their next flagship model.

The paper Inference-Time Scaling for Generalist Reward Modeling is on arXiv.

The post DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT first appeared on Synced.

この記事をシェア

Vercel Blog★42026年6月25日 22:00

AI SDK 7 の発表

Vercel は、週に 1600 万回のダウンロードがある TypeScript 製 AI SDK の新バージョン「7」を発表した。このアップデートにより、推論制御やツール承認機能など、エージェント開発の生産性を高める機能が強化された。

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

Anthropic が提供する AI チャットボット「Claude」が、従来 ChatGPT が独占していた有料顧客市場において支持を集め、シェア拡大に成功していることが示された。

NVIDIA Developer Blog★42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

NVIDIA は、TensorRT の新機能であるマルチデバイス推論サポートを活用し、複数の GPU にわたって AI 推論を効率的にスケーリングする手法を発表した。これにより大規模モデルの実行性能が向上する。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Synced Review·2025年4月11日 23:43·約7分で読める

DeepSeek、次世代R2モデルを示唆しSPCTを用いた推論スケーリングの新手法を公開

#LLM #Reasoning #Reinforcement Learning #DeepSeek #Inference-Time Scaling

TL;DR

AI深層分析2026年5月3日 02:16

重要/ 5段階

深度40%

キーポイント

SPCT による推論時スケーリングの革新

o1 以降の推論パラダイムへの移行

LLM と強化学習の相乗効果（乗算関係）

次世代モデル R2 の登場示唆

同社はこの技術的進展に伴い、直ちに次世代モデル「R2」の発表を予告し、AI 界隈に大きな期待を持たせている。

SPCTによる推論時のスケーラビリティ向上

メタ報酬モデルによる投票プロセスの最適化

強化学習の拡張則における課題への対応

影響分析・編集コメントを表示

影響分析

編集コメント

ポリシー最適化：強化学習の最適化技術が用いられ、嗜好スコアに基づいてモデルの重みを更新し、応答生成の改善を目指します。

DeepSeek の SPCT：LLM における RL のスケーリング課題への対応

SPCT アプローチには 2 つの主要な段階が含まれます：

拒否微調整 (Rejection Fine-Tuning): これはコールドスタートとして機能し、GRM が正しい形式と種類の生成原理および批評に適応できるようにします。

ルールベースオンライン強化学習 (Rule-Based Online RL): この段階は、原理および批評の生成をさらに最適化します。

今後の展望：DeepSeek R2 の登場へ向けて

論文「Inference-Time Scaling for Generalist Reward Modeling」は arXiv に掲載されています。

記事「DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT」は、Synced 上で最初に発表されました。

原文を表示

A comprehensive survey paper titled “Reinforcement Learning Enhanced LLMs: A Survey” outlines the typical three-step process of using RL to train LLMs:

Reward Model Training: Before fine-tuning, a reward model (or reward function) is trained to approximate human preferences and evaluate different LLM outputs.

Preference-Based Fine-Tuning: In each fine-tuning iteration, the large language model generates multiple responses to a given instruction, and each response is scored using the trained reward model.

Policy Optimization: Reinforcement learning optimization techniques are used to update the model’s weights based on the preference scores, aiming to improve response generation.

Integrating reinforcement learning allows large language models to dynamically adjust based on varying preference scores, moving beyond the limitations of a single, pre-determined answer.

DeepSeek’s SPCT: Addressing the Scaling Challenges of RL for LLMs

The SPCT approach involves two key stages:

Rejection Fine-Tuning: This serves as a cold start, enabling the GRM to adapt to generating principles and critiques in the correct format and type.

Rule-Based Online RL: This stage further optimizes the generation of principles and critiques.

Looking Ahead: DeepSeek R2 on the Horizon

The paper Inference-Time Scaling for Generalist Reward Modeling is on arXiv.

The post DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT first appeared on Synced.

この記事をシェア

Vercel Blog★42026年6月25日 22:00

AI SDK 7 の発表

TechCrunch AI★42026年6月26日 02:38

Anthropic の Claude が有料消費者層で ChatGPT を凌駕し市場を席巻

NVIDIA Developer Blog★42026年6月26日 01:43

NVIDIA TensorRT を用いた複数 GPU での AI 推論のスケーリングとマルチデバイス推論サポートの紹介

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

DeepSeek、次世代R2モデルを示唆しSPCTを用いた推論スケーリングの新手法を公開

キーポイント

影響分析

編集コメント

関連記事

DeepSeek、次世代R2モデルを示唆しSPCTを用いた推論スケーリングの新手法を公開

キーポイント

影響分析

編集コメント

関連記事