Amazon Science·2026年5月27日 00:17·約7分

多様な推論経路が LLM により良い意思決定を教える

#LLM #Reasoning #Supervised Fine-Tuning #Reinforcement Learning #Amazon Science

TL;DR

Amazon Science は、LLM の推論能力を向上させるため、単一の正解ではなく多様な推論経路を学習する「セット教師付き微調整（SSFT）」と新しい強化学習手法を発表し、ベンチマークで5〜7%の精度向上を実現した。

AI深層分析2026年5月27日 10:02

重要/ 5段階

深度40%

キーポイント

多様性のある推論経路の学習

従来の単一正解に基づく微調整（SFT）ではなく、同じ問題に対する複数の異なる推論パスをセットとして学習する「セット教師付き微調整（SSFT）」手法を提案した。

グローバルフォークトークンの導入

<think1>から<think6>までの専用トークンを導入し、モデルに異なる推論モードや戦略を採用させる仕組みを構築した。

モード崩壊の防止と最適化

単純な微調整で起こりがちな「モード崩壊」（異なるトークンが同じ出力を生む現象）を防ぐため、文脈に応じた推論戦略を選択させるための強化学習手法（Global Forking Policy Optimization）を併用した。

実証された性能向上

標準的なベンチマークにおいて、単一ショットの精度が5%から7%向上し、推論モードの選択能力がエンドツーエンドのパフォーマンスに直結することを示した。

重要な引用

Can we expand the limits of LLMs' reasoning capacities by training them on diverse reasoning traces for each question?

Naïve post-training strategies such as SFT can lead to mode collapse, where different reasoning tokens produce nearly identical behaviors.

SSFT models it as a set of complete solution paths, which arrive at the same answer through different strategies.

影響分析・編集コメントを表示

影響分析

本研究は、LLM の推論能力を高めるためのパラダイムシフトを示唆しており、単一の正解を追うのではなく、多様な思考プロセスを学習に組み込むことでモデルの汎用性と精度を同時に向上させる可能性を開いた。特に「モード崩壊」という課題に対する具体的な解決策と、それを補完する強化学習アプローチは、今後の大規模言語モデルの微調整や推論最適化における標準的な手法として採用される可能性が高い。

編集コメント

LLM の推論能力向上において、単に「正解」を教えるだけでなく「多様な思考プロセス」を教えるという発想の転換が示されており、今後のモデル開発における重要な指針となるでしょう。

大規模言語モデル（LLM）は、ラベル付けされていない膨大なデータセット上で事前学習されますが、その後、指示の追従や有害な出力の回避、推論の実行、または生成された出力に対する正当性の提示といった特定のタスクに対してポストトレーニングが行われます。並列推論——すなわち、同じ問題に対して複数の多様な推論経路を生成して比較する手法——は、LLM の推論能力の限界を理解するための重要なツールとして台頭しています。また、これは自己一貫性（self-consistency）といった LLM をテストする技術の基盤でもあり、ここでは複数の推論経路を集約して精度を向上させます。LLM は通常、各トレーニング例に単一の人間検証済み推論トレースがラベル付けされた教師付き微調整（SFT: Supervised Fine-Tuning）を通じて推論のために最適化されています。評価における並列推論の有効性を考慮すると、「同じ質問に対して多様な推論トレースでモデルを訓練することで、LLM の推論能力の限界を広げられるだろうか」という疑問が自然に生じます。今年開催された国際学習表現会議（ICLR）で発表した論文において、私たちはこの課題に対する解決策を提案します。これは、従来指摘されていた並列推論のいくつかの落とし穴を回避する手法です。

単一の LLM に異なる推論戦略を採用させるために、ポストトレーニング段階において一連のグローバルフォークトークン（図中の through など）を導入しました。各トークンは特定の推論モードを引き出すことを意図しています。これらのトークンにより、モデルは同じ問題に対して多様で高品質な推論経路を生成できるようになります。しかし、SFT などの単純なポストトレーニング戦略では、異なる推論トークンがほぼ同一の振る舞いを生み出す「モード崩壊」を引き起こす可能性があります。これを解決するため、私たちはセット教師付き微調整（SSFT: set-supervised fine tuning）を提案します。これは、多様な監督情報から複数の異なる推論戦略をモデルに学習させる、シンプルかつ原理に基づいたトレーニングアプローチです。SSFT は推論を単一のトレースで表現するのではなく、異なる戦略を通じて同じ答えに至る一連の完全な解決経路としてモデル化します。

さらに、どの文脈でどの推論戦略を採用すべきかをモデルに教えるために、「グローバルフォークポリシー最適化（GFPO: global forking policy optimization）」と呼ばれる強化学習のパラダイムを導入しました。これら 2 つの技術を組み合わせることで、標準的なベンチマークにおいてシングルショット精度が 5% から 7% 向上することが確認されました。これは、推論モードの選択能力の向上が直接的にエンドツーエンドのパフォーマンスの改善につながることを示しています。

教師付き微調整の実践においては、同じ質問に対する複数の推論トレースは、複数の教師モデルへのプロンプト、単一モデルからの代替推論経路のサンプリング、あるいは異種ソースからの解決策の集約によって取得できます。SSFT は、各トークンが異なる推論モードを示すように、これらの各トレースに専用のフォークトークン（through など）をペアリングします。トレーニング中、二部マッチングステップにより、各質問に対してトレースとトークンを割り当て、モデルが単一のパターンに収縮するのではなく、異なる振る舞いを学習することを促します。トレーニングの目的関数は、割り当てられた制御トークンの条件付きで評価される各推論トレースに対する次トークン予測（NTP: next-token prediction）損失の合計です。その結果、各フォークトークンは特定の推論戦略に特化し、モデルはより多様な解決策を生成します。これは pass@k（k 個生成された回答のうち少なくとも 1 つが正解である確率）で測定される多様性として評価されつつも、強いシングルショット精度（pass@1）を維持します。

教師付きトレーニングはモデルに多様な推論戦略を学習させることを促しますが、特定の質問に対してどの戦略を使用すべきかを明示的に教えるものではありません。適切な推論モードの選択は本質的に意思決定問題であり、強化学習への自然な適用対象です。私たちはこれを、各入力に対して最も効果的な推論モードを選択することを学習する軽量な強化学習アプローチであるグローバルフォークポリシー最適化（GFPO）で解決します。ある質問 x に対し、モデルは制御トークン（s）の分布からグローバルフォークトークンをサンプリングします。その後、サンプリングされたトークンの条件付きで回答を生成し、出力が検証されて報酬信号（正解または不正解など）が得られます。これらの報酬はアドバンテージに変換され、フォークトークンに関するポリシーの更新に使用されます。重要なのは、生成された推論トレースがロールアウトとして扱われ、その勾配は切り離され、直接最適化には使用されず、報酬計算のみに行われる点です。最適化をフォークトークンの分布に集中させることで、GFPO はトークンレベルの強化学習の複雑さを回避しつつ、重要な意思決定——適切な推論モードを事前に選択すること——を捉えることができます。これにより、トレーニングは効率的かつ安定し、直接的にエンドツーエンドのパフォーマンスが向上します。

SSFT と GFPO を組み合わせることで、モデルは多様な推論戦略を学習するとともに、推論時に適切なものを選択できるようになります。

評価では、SSFT+GFPO を推論およびコーディングのベンチマークにおいて、(i) 精度と (ii) 推論の多様性の 2 つの軸で評価しました。すべての設定において、SSFT+GFPO は SFT+GRPO などの標準パイプラインを一貫して上回りました。

AIME 2025 (Pass@1): 58.80%

AIME 2024 (Pass@1): 64.22%

LiveCodeBench-v5 (Pass@1): 52.07%

SFT+GRPO 比 +6.84

SFT+GRPO 比 +5.37

SFT+GRPO 比 +4.94

精度以外にも、SSFT の主要な目標の 1 つはモード崩壊への対処です。SSFT は明示的に専門化を促し、異なるトークンが異なる推論戦略を表すことを可能にします。これにより 2 つの重要な効果が生まれます。第一に、各グローバルフォークトークンは一貫して特定の推論パターンを引き起こします。第二に、この多様性は pass@1 を損なうことなく pass@k を向上させます。これは、温度ベースのサンプリングとは対照的です。後者では、多様性を高めることは通常、精度を犠牲にすることを意味します。

以下に、AIME 2025 ベンチマークからの代表的な問題に対する私たちのアプローチの定性的な例を示します。このベンチマークは難解な数学推論データセットです。同じ質問が、選択されたグローバルフォークトークンに応じて、代数的操作、幾何学的推論、ケースベース分析など、質的に異なる複数の戦略を用いて解決されます。

原文を表示

Large language models (LLMs) are pretrained on huge volumes of unlabeled data, but afterward, they’re typically post-trained on specific tasks such as instruction following, avoiding harmful outputs, and reasoning, or providing justifications for the outputs they generate. Parallel reasoning — in which multiple, diverse reasoning paths are generated and compared for the same problem — is emerging as a key tool for understanding the limits of LLMs’ reasoning capability. It also underpins techniques for testing LLMs such as self-consistency, where multiple reasoning paths are aggregated to improve accuracy. LLMs are generally optimized for reasoning through supervised fine-tuning (SFT), in which each training example is labeled with a single, human-verified reasoning trace. Given the usefulness of parallel reasoning for evaluation, the question naturally arises, Can we expand the limits of LLMs’ reasoning capacities by training them on diverse reasoning traces for each question? In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we propose a method for doing just that, which avoids some previously identified pitfalls of parallel reasoning. To prompt a single LLM to adopt different reasoning strategies, we introduce a set of global forking tokens (such as <think1> through <think6> in the figure below) in the post-training phase, each intended to elicit a distinct reasoning mode. These tokens enable the model to generate diverse, high-quality reasoning paths for the same problem. However, naïve post-training strategies such as SFT can lead to mode collapse, where different reasoning tokens produce nearly identical behaviors. To address this, we propose set-supervised fine tuning (SSFT) — a simple and principled training approach that enables models to learn multiple distinct reasoning strategies from diverse supervision. Instead of representing reasoning with a single trace, SSFT models it as a set of complete solution paths, which arrive at the same answer through different strategies. To further teach the model which reasoning strategy to adopt in what contexts, we introduce a reinforcement learning paradigm we call global forking policy optimization. Between these two techniques, we observe gains of 5% to 7% in single-shot accuracy on standard benchmarks, indicating that improved reasoning-mode selection directly translates to better end-to-end performance. Supervised fine tuning In practice, multiple reasoning traces for the same question can be obtained by prompting multiple teacher models, sampling alternative reasoning paths from a single model, or aggregating solutions from heterogeneous sources. SSFT pairs each such trace with a dedicated forking token (e.g., <think1> through <think6>), where each token indicates a different reasoning mode. During training, a bipartite matching step assigns traces to tokens for each question, encouraging the model to learn distinct behaviors rather than collapsing to a single pattern. The training objective sums the next-token prediction (NTP) losses across all matched pairs, evaluating each reasoning trace conditioned on its assigned control token. As a result, each forking token is specialized to a distinct reasoning strategy, and the model produces more diverse solutions — measured by pass@k, the probability that at least one of k generated answers is correct — while maintaining strong single-shot accuracy ( pass@1). Reinforcement learning While supervised training encourages the model to learn diverse reasoning strategies, it does not explicitly teach the model which strategy to use for a given question. Choosing the right reasoning mode is inherently a decision problem, making it a natural fit for reinforcement learning. We address this with global forking policy optimization (GFPO), a lightweight reinforcement learning approach that learns to select the most effective reasoning mode for each input. For a given question x, the model samples a global forking token from a distribution over control tokens (the <think i>s). The model then produces an answer conditioned on the sampled token, and the output is verified to obtain a reward signal (e.g., correct or incorrect). These rewards are converted into advantages, which are used to update the policy over forking tokens. Importantly, the generated reasoning traces are treated as rollouts: their gradients are detached and used only for computing rewards, not for direct optimization. By focusing optimization on the forking-token distribution, GFPO avoids the complexity of token-level reinforcement learning while still capturing the key decision — selecting the right reasoning mode upfront. This makes training both efficient and stable, while directly improving end-to-end performance. Together, SSFT and GFPO enable models to both learn diverse reasoning strategies and select the right one at inference time. Evaluation We evaluate SSFT+GFPO on both reasoning and coding benchmarks along two axes: (i) accuracy and (ii) diversity of reasoning. Across all settings, SSFT+GFPO consistently outperforms standard pipelines, such as SFT+GRPO. 58.80%64.22%52.07%AIME 2025 (Pass@1)AIME 2024 (Pass@1)LiveCodeBench-v5 (Pass@1)+6.84 vs. SFT+GRPO+5.37 vs. SFT+GRPO+4.94 vs. SFTBeyond accuracy, a key goal of SSFT is to address mode collapse. SSFT explicitly encourages specialization, allowing different tokens to represent distinct reasoning strategies. This leads to two important effects. First, each global forking token consistently triggers a distinct reasoning pattern. Second, this diversity improves pass@k without compromising pass@1. This contrasts with temperature-based sampling, where increasing diversity typically comes at the cost of accuracy. Below, we present a qualitative example illustrating our approach on a representative problem from the AIME 2025 benchmark, a challenging math reasoning dataset. The same question is solved using multiple qualitatively distinct strategies — such as algebraic manipulation, geometric reasoning, and case-based analysis — depending on the selected global forking token.

この記事をシェア

NVIDIA Developer Blog重要度42026年7月15日 03:20

リーダーボードからの教訓：5,000 人以上のカグラーが AI の推論能力向上に何を教えてくれたか

NVIDIA Developer Blog重要度42026年7月15日 01:00

RL エージェントのスキルを活用した自己研究ワークフローの実行方法と NVIDIA NeMo の活用

AWS Machine Learning Blog重要度52026年7月14日 06:01

OpenAI の GPT-5.6 Sol、Terra、Luna が Amazon Bedrock で一般利用可能に

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Amazon Science·2026年5月27日 00:17·約7分

多様な推論経路が LLM により良い意思決定を教える

#LLM #Reasoning #Supervised Fine-Tuning #Reinforcement Learning #Amazon Science

TL;DR

AI深層分析2026年5月27日 10:02

重要/ 5段階

深度40%

キーポイント

多様性のある推論経路の学習

グローバルフォークトークンの導入

<think1>から<think6>までの専用トークンを導入し、モデルに異なる推論モードや戦略を採用させる仕組みを構築した。

モード崩壊の防止と最適化

実証された性能向上

重要な引用

Can we expand the limits of LLMs' reasoning capacities by training them on diverse reasoning traces for each question?

Naïve post-training strategies such as SFT can lead to mode collapse, where different reasoning tokens produce nearly identical behaviors.

SSFT models it as a set of complete solution paths, which arrive at the same answer through different strategies.

影響分析・編集コメントを表示

影響分析

編集コメント

SSFT と GFPO を組み合わせることで、モデルは多様な推論戦略を学習するとともに、推論時に適切なものを選択できるようになります。

AIME 2025 (Pass@1): 58.80%

AIME 2024 (Pass@1): 64.22%

LiveCodeBench-v5 (Pass@1): 52.07%

SFT+GRPO 比 +6.84

SFT+GRPO 比 +5.37

SFT+GRPO 比 +4.94

原文を表示

この記事をシェア

NVIDIA Developer Blog重要度42026年7月15日 03:20

リーダーボードからの教訓：5,000 人以上のカグラーが AI の推論能力向上に何を教えてくれたか

NVIDIA Developer Blog重要度42026年7月15日 01:00

RL エージェントのスキルを活用した自己研究ワークフローの実行方法と NVIDIA NeMo の活用

AWS Machine Learning Blog重要度52026年7月14日 06:01

OpenAI の GPT-5.6 Sol、Terra、Luna が Amazon Bedrock で一般利用可能に

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

多様な推論経路が LLM により良い意思決定を教える

キーポイント

重要な引用

影響分析

編集コメント

関連記事

多様な推論経路が LLM により良い意思決定を教える

キーポイント

重要な引用

影響分析

編集コメント

関連記事