Hugging Face Blog·2026年1月27日 10:53·約3分

GPT-OSSのエージェンシック強化学習トレーニングの実現：実践的振り返り

#GPT-OSS #Agentic RL #Reinforcement Learning #Open Source LLMs #Verl

TL;DR

Hugging Face は、GPT-OSS モデルを単なる静的応答生成から、環境との対話を通じて意思決定を行う「エージェント型 RL」の基盤として活用可能にするための技術的検証と実践的なアプローチを公開した。

AI深層分析2026年5月2日 05:03

重要/ 5段階

深度40%

キーポイント

Agentic RL の定義と従来手法との違い

従来の単一ターンやオフライン学習とは異なり、エージェント型 RL は環境との直接的な対話を通じて多段階の意思決定プロセスを最適化するアプローチである。

GPT-OSS の実用性検証と課題

GPT-OSS は OpenAI の最新モデルと比較される性能を持つが、ツール呼び出しや多ステップ作業を含むエージェント型 RL への適用は未検証であり、本記事でその道筋を示す。

LinkedIn における実装の背景

LinkedIn は不完全な情報下での推論や構造化サービスとの対話が必要となるプロフェッショナル向けエージェントを構築しており、この文脈で Agentic RL の重要性が強調されている。

Verl フレームワークの活用

実験では verl をトレーニングフレームワークとして採用し、GRPO や PPO などのアルゴリズムを用いてエージェントの行動を反復的に最適化するループを実装している。

GPT-OSS RL 訓練の初期課題

Harmony チャットテンプレートの導入により、トレーニングフレームワークとの整合性が保たれておらず、KL 発散やエントロピーの爆発が発生した。

ReTool を用いたアジェンティックコーディング検証

モデルがコードコンパイラツールと対話しながら数学問題を解決する ReTool タスクを用い、推論ロジックの強化と実行結果によるフィードバックループを検証した。

MoE モデルにおける PPO の重要性サンプリング比率の不整合

オンポリシー学習において重要度サンプリング比が厳密に 1 であるべきという前提に対し、MoE モデルの実装で非ゼロのクリップ値が発生する不具合を発見し修正した。

影響分析・編集コメントを表示

影響分析

この記事は、オープンソースモデル（GPT-OSS）が単なるチャットボットを超えて、複雑なタスクを自律的に実行する「エージェント」として機能するための重要な技術的ブレイクスルーを示唆しています。特に、Verl フレームワークを活用した実践的なアプローチの提示は、開発者が既存のオープンソースモデルを即座に高度な自律システムへ転用する際の指針となり、業界全体の AI エージェント化のスピードを加速させる可能性があります。

編集コメント

GPT-OSS のようなオープンソースモデルが、単なる言語理解の枠を超えて自律的な意思決定を行うエージェントとして実用化される道筋を明確にした重要な記事です。Verl などのフレームワークとの組み合わせにより、研究段階の技術が実際のビジネス現場で即座に活用可能になる可能性を示しています。

図8. 左: シーケンス並列化なしの推論。右: シーケンス並列化ありの推論。アテンション層の前後に追加の全対全通信が実行され、シーケンスが並列ワーカー間で分割されることで、アテンション計算のピークメモリ使用量がシーケンス並列化度に比例して削減される。

シーケンス並列化は、シーケンス次元に沿ってスケールし、GPUあたりのアクティベーション使用量を削減する。すべてのシーケンスからの入力トークンは、パディングトークンを除去して単一の連続したリストにパックされ、異なるシーケンスに属するトークンを区別するために位置IDが使用される。この設計は、FlashAttentionの可変長サポートを自然に活用する。シーケンス並列化において、アテンション層以外の層には位置間の依存関係がないため、各GPUが完全なシーケンスの断片を保持する必要はなく、これらの層に対して追加の通信は不要である。

しかし、アテンション層は、アテンション重みを正しく計算するために、同じシーケンスに属するすべてのトークンが同じGPU上に存在することを必要とする。この制約を満たすために、アテンションヘッドレベルで分割を行い、シーケンス要素を収集するための全対全通信が実行される。この設計により、アテンション計算自体の中での通信（これは非常に高コストになりうる）を回避する。アテンション層の後、単一の全対全通信が出力を元のシーケンス並列レイアウトに再分配し、その後、残りの非アテンション層はさらなる同期なしで進行できる。

GPT-OSSバックボーンモデルでエージェンシックRLトレーニングを可能にした我々の取り組みは、オープンソースLLMで高度な能力を実現するには、緻密で深掘りしたエンジニアリングが必要であることを示す実践的回顧録となった。

我々は、特に以下の点で、GPT-OSSをエージェンシックアプリケーションに対して実用的なモデルへと変革する貢献を行った:

PPOの安定化: MoEアーキテクチャの非決定性によって引き起こされるログ確率の不一致を上書きし、方策上の完全性を回復する修正を貢献した（図2）。

アテンションシンクサポートの実現: アテンションシンクの逆伝播をFlashAttention v3に実装・統合することに成功し、以前は不安定性と収束の遅さを引き起こしていたトレーニング-推論の不一致を修正した（図5、6、7）。

メモリ効率のスケーリング: MoEマテリアライゼーションプロセスのパッチ適用や、新しいアテンションシンクサポートとのシーケンス並列化の統合など、重要なメモリ最適化を導入し、マルチステップエージェントに不可欠な長いコンテキストウィンドウでのトレーニングを可能にした（図8）。

これらのエンジニアリング努力は、GPT-OSSが、次世代の知的でマルチステップな意思決定エージェントを構築するためのスケーラブルで高性能なバックボーンであることを実証している。

謝辞

Deepak Agarwal、Bee-Chung Chen、Animesh Singh、Gungor Polatkan、Balaji Krishnapuram、Jitendra Agarwalのリーダーシップサポートに感謝する。

Feng, Jiazhan, et al. Retool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv preprint arXiv:2504.11536 (2025).

Xiao, Guangxuan, et al. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).

When Speed Kills Stability: Demystifying RL Collapse from the Training–Inference Mismatch. https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda

原文を表示

Back to Articles Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Upvote 59

Agentic reinforcement learning (RL) extends traditional LLM training by optimizing not just a single-turn response, but an entire decision-making process learned through direct interaction with an environment during training. Unlike traditional single-turn reinforcement learning or offline preference-based methods that rely on static datasets, agentic RL trains policies by actively collecting on-policy data as the agent plans actions, invokes tools, observes outcomes, and adapts its behavior over multi-step trajectories in either simulated or real environments. This interaction-driven optimization assigns credit across long-horizon decisions, where intermediate choices such as query reformulation, tool selection, and execution order directly influence downstream success. Training follows an iterative closed loop in which the agent interacts with the environment to collect rollout trajectories, computes rewards over these trajectories, updates the policy based on observed outcomes, and then uses the updated policy to drive the next round of interaction and data collection such as GRPO or PPO algorithms..

LinkedIn is an AI-first company that's built agents to help professionals be more successful. In this setting, models must reason over incomplete information, interact with structured services, and adapt to evolving user intent across multiple steps rather than produce a single static response. These capabilities are especially critical for agents that support the goals of recruiters, job and knowledge seekers, and learners end users, such as retrieving information, refining queries, coordinating tools, and executing multi-step workflows. By learning robust decision policies through interaction, agentic RL provides a principled foundation for building scalable, reliable, and adaptable AI systems through end-to-end optimization.

The GPT-OSS model has shown comparable performance to OpenAI o3-mini and o4-mini [ref], but its suitability for agentic reinforcement learning training has not yet been validated. Most recent work focuses on fine-tuning without tool calling, such as: Fine-tuning with gpt-oss and Hugging Face Transformers and unsloth tutorial: how to fine-tune gpt-oss. This blog explores the journey to unlock agentic RL training for GPT-OSS as a potential backbone model for agentic applications.

In our experiments, we use verl as our training framework since it is one of the most popular adopted frameworks in the open source community. We use gsm8k, Retool task, verifiable instruction following task, which are commonly used in RL training. We focus on presenting experimental results for the GPT-OSS-20B model, and our attention-sink fix also works for GPT-OSS-120B. The Qwen-2.5-32B model is additionally used to benchmark standard metric trends during RL training.

Challenges of GPT-OSS RL Training

verl has been an OSS framework used by the team, and the team has previously collaborated and contributed to it to help democratize agentic reinforcement learning training. With the introduction of the new Harmony chat template in GPT-OSS, the first step is to ensure that the training framework fully supports the updated message format and conversation semantics required by Harmony. This step helps rollout generation, trajectory construction, and tool parsing remain consistent and correct under the new template.

The team uses ReTool as a representative example to verify code correctness. ReTool is an agentic coding task in which the model is asked to solve a math problem with the assistance of a code compiler tool. This setup allows the model to focus on core reasoning and algorithmic logic, while delegating the actual arithmetic and execution to the tool. During an episode, the model interacts with the code tool multiple times, using execution results as feedback to refine its solution. At the end of the trajectory, the model produces a final answer, on which the reward is computed.

During the initial training runs, we observed exploding KL divergence and entropy, along with non-increasing rewards, indicating underlying issues in the GPT-OSS training setup, as shown in Figure 1.

Average Gradient Norm

Figure 1. Left: Qwen32b has significantly higher rewards compared to GPT-OSS 20B; Right: The gradient norm exploded as training progressed.

A Practical Debugging Journey in verl: Restoring PPO On-Policy Integrity

Restoring PPO On-Policy Integrity: A Fix for MoE Log-Probability Mismatch

Figure 2. Non-zero importance sampling clip value even for on-policy training.

We focus on on-policy methods because they provide greater stability and more reliable convergence. The foundation of pure on-policy Proximal Policy Optimization (PPO) mandates that the importance sampling ratio must be exactly 1. The mathematical definition of the importance ratio is:

ratio=π(a∣s)πold(a∣s) \text{ratio} = \frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)} ratio=πold(a∣s)π(a∣s)

This requirement ensures that the policy update is executed only on the data generated by the current policy π(a | s) = πold(a | s), preventing unintended clipping.

We have observed the non-zero clipping value in our ReTool training, as shown in Figure 2, stemming from a mismatch between the two log-probabilities:

Current log-probability log_prob

Old log-probability old_log_prob

Root Cause: The Dual Forward Pass and MoE Architecture

Prior to verl 0.3.0, the implementation relied on two separate forward passes (one to compute the current log_prob

In a Mixture of Experts (MoE) architecture like GPT-OSS, the gating network routes the input to different experts. Due to implementation factors (e.g., subtle floating-point differences or explicit stochasticity), the expert routing can differ slightly between the two passes. Readers who are interested can further read Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers.

This difference in routing leads to:

log⁡(π(a∣s))≠log⁡(πold(a∣s)) \log(\pi(a \mid s)) \neq \log(\pi_{\text{old}}(a \mid s)) log(π(a∣s))=log(πold(a∣s))

The resulting ratio deviates from 1, falsely triggering the PPO clip and violating the core on-policy assumption.

Solution: Enforcing Ratio = 1 via Log-Probability Substitution

The fix resolves the issue by logically overriding the flawed computation when the environment is known to be on-policy (i.e., when the minibatch size equals the global batch size):

if on_policy: old_log_prob = log_prob.detach() else: old_log_prob = model_inputs["old_log_probs"]

By setting old_log_prob

Correcting Training–Inference Mismatch

Although fixing the log-probability mismatch reduced the importance-sampling clip ratio to zero, gradient norms continued to explode and rewards failed to improve. To isolate the issue, we simplified training to GSM8K, a single-step task without agentic tool use. The same instability persisted, as shown in the green curves in Figure 3, indicating a fundamental issue in basic RL training with GPT-OSS under verl.

We hypothesize that training–inference mismatch could be a potential cause: discrepancies between inference-time execution—where engines such as vLLM and SGLang aggressively optimize for throughput—and training-time execution under FSDP, which prioritizes numerical precision and stability, can effectively turn otherwise on-policy RL into off-policy optimization.

This blog details why such mismatches lead to unstable gradients and non-improving rewards. Figure 3 compares training runs with and without rollout correction (see this verl blog for details). After applying rollout correction, training dynamics improve significantly, with gradient norms remaining stable rather than exploding.

However, as shown in the left plot of Figure 4, the reward increases only modestly, and convergence on the simple GSM8K task remains substantially slower compared to smaller dense model variants.

Average Entropy

Average Gradient Norm

Average KL Loss

Figure 3. Gradient norm behavior under different training configurations. Green: Training without rollout correction, exhibiting unstable gradients. Red: Training with the attention layer frozen to isolate the issue to the attention mechanism, resulting in partial stabilization. Blue: Training with rollout correction enabled (sequence-level importance sampling), yielding stable gradient norms.

Max Log-Perplexity Difference

Figure 4. Left: Reward improvement on GSM8K remains slow even after applying rollout correction, with performance comparable to runs where the attention layer is frozen during training. Right: A substantial log-ppl mismatch is observed between the inference engine (SGLang with Triton kernels supporting attention-sink forward passes) and the training stack (FSDP with FlashAttention-v2), indicating a large training–inference inconsistency.

To further isolate the root cause, we freeze the attention layers during training and observe reward dynamics similar to those of runs without freezing (blue curve vs yellow curve in Figure 4). This indicates that learning is primarily driven by the MoE layers, while the attention mechanism contributes less effectively than expected. In addition, we observe a substantial token-level probability mismatch between the inference engine and the distributed training stack which are using different attention kernels. Together, these observations motivate a deeper investigation into the attention mechanism.

Attention Sink Support in FlashAttentionV3

Attention sinks used in GPT-OSS are learnable scalar parameters (one per attention head) that act as "virtual tokens" in the softmax computation. They allow the model to allocate attention mass to a learned sink rather than forcing all attention to content tokens, which has been shown to improve attention stability in streaming inference and training with sliding-window attention.

After a deeper investigation, we identified several major issues:

verl hard-codes FlashAttention v2 in fsdp_worker

The attention sink backward pass is not supported in FlashAttention v2 and v3, so it does not work as expected even when FlashAttention v3 is enabled.

Since the forward pass has not yet been merged into the original FlashAttention v3 repository, we leveraged the forward pass from the vLLM FlashAttention fork (PR #75) and implemented the backward pass to compute the sink gradient.

Standard Attention

scores = QK^T / sqrt(d) # [B, H, N_q, N_k] probs = softmax(scores, dim=-1) # Σ_j P_ij = 1 output = probs @ V # [B, H, N_q, d_v]

Attention with Sinks (GPT-OSS)

scores = QK^T / sqrt(d) # [B, H, N_q, N_k] combined = concat([scores, sink_param], dim=-1) # [B, H, N_q, N_k+1] probs = softmax(combined, dim=-1) # Σ_j P_ij + P_sink = 1 probs_content = probs[..., :-1] # Drop sink component output = probs_content @ V # [B, H, N_q, d_v]

Key difference: The sink participates in softmax normalization but doesn't contribute to the output.

Mathematical Formulation

The attention weight for content token j in row i is defined as:

Pij=exp⁡(Sij)∑j′=1Nkexp⁡(Sij′)+exp⁡(Sh) P_{ij} = \frac{\exp(S_{ij})} {\sum_{j'=1}^{N_k} \exp(S_{ij'}) + \exp(S_h)} Pij=∑j′=1Nkexp(Sij′)+exp(Sh)exp(Sij)

Sij = Qi Kj⊤ / √d are the attention scores

Pij are the attention weights for the content tokens

Sh is the learnable sink parameter for head h

Sink Probability:

The sink probability is computed but not used in the output:

Pi,h=exp⁡(Sh)∑j′=1Nkexp⁡(Sij′)+exp⁡(Sh) P_{i,h} = \frac{\exp(S_h)} {\sum_{j'=1}^{N_k} \exp(S_{ij'}) + \exp(S_h)} Pi,h=∑j′=1Nkexp(Sij′)+exp(Sh)exp(Sh)

The gradient of the loss L with respect to the sink parameter Sh is:

∂L∂Sh=−∑iPi,h(∂L∂Si,h−∑j∈{1,…,Nk}Pij∂L∂Sij) \frac{\partial L}{\partial S_h} = - \sum_i P_{i,h} \left( \frac{\partial L}{\partial S_{i,h}} - \sum_{j \in \{1,\ldots,N_k\}} P_{ij} \frac{\partial L}{\partial S_{ij}} \right) ∂Sh∂L=−i∑Pi,h∂Si,h∂L−j∈{1,…,Nk}∑Pij∂Sij∂L

Pi,h is the sink attention probability for row i

∂L/∂Sij is the gradient with respect to the attention scores, including the sink

Simplified Gradient:

Since the sink is computed but not used in the output, its gradient ∂L/∂Si,h = 0.

Therefore, the backward equation simplifies to:

∂L∂Sh=−∑iPi,h(∑j∈{1,…,Nk}Pij∂L∂Sij) \frac{\partial L}{\partial S_h} = - \sum_i P_{i,h} \left( \sum_{j \in \{1,\ldots,N_k\}} P_{ij} \frac{\partial L}{\partial S_{ij}} \right) ∂Sh∂L=−i∑Pi,hj∈{1,…,Nk}∑Pij∂Sij∂L

The forward pass was adapted from vLLM's FlashAttention fork, and we implemented the backward pass to compute gradients for the sink parameters. The implementation will be released following the internal review process.

After applying the fix in FlashAttention v3, we observe substantially faster convergence for GPT-OSS-20B across a range of reinforcement learning tasks. These include single-turn RL on math reasoning (GSM8K — red curve in Figure 5), instruction following (VerifyIf, evaluated on an out-of-domain multi-if benchmark — Figure 6), and multi-turn agentic RL with tool use (ReTool — Figure 7).

Across all settings, training becomes stable and exhibits steady reward improvement.

Figure 5.. Single Turn GSM8K, the red curve converges much faster than the rest without the fix

Average Entropy

Average Gradient Norm

Figure 6. On verifiable instruction following the task, the run without the fix collapsed (blue), and the run with fix showed steady reward improvement.

Average Gradient Norm

Validation Accuracy

Figure 7. On the Retool task, the run with fix showed steady reward improvement and no gradient exploding (fa2 is the flash attention 2 without the fix while fa3 is the flash attention 3 with the fix). After the fix, the validation accuracy score goes up now.

Memory-Efficient Training

Mitigating FSDP Memory Blow-Ups Caused by Repeated MoE Expert Materialization

One issue we consistently encountered was excessive memory allocation during the FSDP forward pass, which led to repeated out-of-memory (OOM) failures when training GPT-OSS-20B bf16 models on 16 H200 nodes (max response length: 16k, prompt length: 8k). This behavior is highly unexpected for a 20B-parameter MoE model.

2025-11-27T11:15:27.927Z [36m(TaskRunner pid=32081)[0m File "/home/jobuser/.local/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 123, in forward 2025-11-27T11:15:27.927Z [36m(TaskRunner pid=32081)[0m hidden_states = hidden_states.repeat(num_experts, 1) 2025-11-27T11:15:27.927Z [36m(TaskRunner pid=32081)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 180.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 110.94 GiB is free. Process 685851 has 24.88 GiB memory in use. Process 692458 has 3.87 GiB memory in use. Of the allocated memory 23.28 GiB is allocated by PyTorch, and 84.43 MiB is reserved by PyTorch but unallocated.

We identified the issue as originating from two different implementations of the MoE forward path in Hugging Face Transformers. This issue has also been reported by other users: https://github.com/huggingface/transformers/issues/40073; When verl computes log-probabilities under FSDP, the inference forward path is triggered. In the current Hugging Face implementation, this path duplicates hidden states for all experts and performs batched matrix multiplication, materializing extremely large tensors in GPU memory. By contrast, the training forward path uses a for-loop to process each expert sequentially and then combines the results. While slower, this approach is significantly more memory efficient.

@GPUMemoryLogger(role="dp actor", logger=logger) def compute_log_prob(self, data: DataProto, calculate_entropy=False) -> torch.Tensor: """ .... """ # set to eval, this essentially prioritizes parallelism at the cost of memory efficiency self.actor_module.eval() ...

We patched the Hugging Face implementation to use a more memory-efficient execution path, avoiding repeated materialization of experts.

Sequence Parallel with Flash Attention V3

Agentic RL requires the agent to interact with the environment over multiple steps while maintaining an ever-expanding context. Observations and environment feedback from each step are appended to the context and used as input for subsequent decision-making, which introduces significant challenges for memory efficiency and scalability during training.

Under fully sharded data parallelism (FSDP), model parameters, optimizer states, and gradients are sharded across the entire world size (i.e., all GPUs in the training cluster). Each GPU stores and updates only its assigned parameter shards, while rollout data are replicated across all GPUs—meaning every GPU processes the full agent interaction history for each rollout.

During the forward pass, when computation reaches a layer whose parameters are not locally available, an all_gather

FSDP provides model-level scaling by sharding model parameters, gradients, and optimizer states across GPUs. Sequence parallelism (or context parallelism) further reduces per-GPU memory consumption by partitioning the input sequence across devices, thereby lowering the peak activation memory on each GPU.

As the number of sequence-parallel dimensions increases, the maximum activation memory per GPU correspondingly decreases. We have implemented sequence parallelism to be attention-sink-aware and compatible with FlashAttention v3 (Figure 8, right).

Figure 8. Left: Inference without sequence parallelism. Right: Inference with sequence parallelism, where additional all-to-all communication is performed before and after the attention layer. This partitions the sequence across parallel workers and reduces the peak memory footprint of attention computation by a factor proportional to the sequence-parallelism degree.

Sequence parallelism scales along the sequence dimension to reduce the per-GPU activation footprint. Input tokens from all sequences are packed into a single contiguous list by removing padding tokens, while position IDs are used to distinguish tokens belonging to different sequences. This design naturally benefits from FlashAttention’s variable-length support. For sequence parallelism, layers other than the attention layer do not have inter-position dependencies; therefore, they do not require each GPU to hold a complete sequence shard, and no additional communication is needed for these layers.

The attention layer, however, requires all tokens belonging to the same sequence to be present on the same GPU in order to compute attention weights correctly. To satisfy this constraint, an all-to-all communication is performed to gather sequence elements, with the split performed at the attention-head level. This design avoids communication within the attention computation itself, which would otherwise be prohibitively expensive. After the attention layer, a single all-to-all communication redistributes the outputs back to their original sequence-parallel layout, after which the remaining non-attention layers can proceed without further synchronization.

Our journey to enable agentic RL training for the GPT-OSS backbone model was a practical retrospective, highlighting that unlocking advanced capabilities in open-source LLMs requires meticulous, deep-dive engineering.

We made contributions that transformed the viability of GPT-OSS for agentic applications, specifically by:

Stabilizing PPO: We contributed a fix to restore on-policy integrity, overriding the log-probability mismatch caused by the MoE architecture’s non-determinism (Figure 2).

Enabling Attention Sink Support: We successfully implemented and integrated the attention sink backward pass into FlashAttention v3, correcting the catastrophic training–inference mismatch that had previously caused instability and slow convergence (Figures 5, 6, and 7).

Scaling Memory Efficiency: We introduced crucial memory optimizations, including patching the MoE materialization process and integrating sequence parallelism with the new attention sink support, enabling training with the long context windows essential for multi-step agents (Figure 8).

These engineering efforts validate GPT-OSS as a scalable and high-performance backbone for building the next generation of intelligent, multi-step decision-making agents.

Acknowledgments

Thanks to Deepak Agarwal, Bee-Chung Chen, Animesh Singh, Gungor Polatkan, Balaji Krishnapuram, and Jitendra Agarwal for their leadership support.

Feng, Jiazhan, et al. Retool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv preprint arXiv:2504.11536 (2025).

Xiao, Guangxuan, et al. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).

この記事をシェア

MarkTechPost重要度42026年7月4日 15:32

NVIDIA AI が自己改善型ロボットフレームワーク「ASPIRE」を発表、LIBERO-Pro の長期タスクでゼロショット成功率 31% を達成

AWS Machine Learning Blog重要度42026年7月3日 02:50

Amazon SageMaker AIにおける多ターン強化学習のベストプラクティス

Apple Machine Learning重要度42026年7月2日 09:00

強化学習微調整済み視覚言語モデルの頑健性と思考連鎖の一貫性に関する研究

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年1月27日 10:53·約3分

GPT-OSSのエージェンシック強化学習トレーニングの実現：実践的振り返り

#GPT-OSS #Agentic RL #Reinforcement Learning #Open Source LLMs #Verl

TL;DR

AI深層分析2026年5月2日 05:03

重要/ 5段階

深度40%

キーポイント

Agentic RL の定義と従来手法との違い

GPT-OSS の実用性検証と課題

LinkedIn における実装の背景

Verl フレームワークの活用

GPT-OSS RL 訓練の初期課題

Harmony チャットテンプレートの導入により、トレーニングフレームワークとの整合性が保たれておらず、KL 発散やエントロピーの爆発が発生した。

ReTool を用いたアジェンティックコーディング検証

MoE モデルにおける PPO の重要性サンプリング比率の不整合

影響分析・編集コメントを表示

影響分析

編集コメント

我々は、特に以下の点で、GPT-OSSをエージェンシックアプリケーションに対して実用的なモデルへと変革する貢献を行った:

PPOの安定化: MoEアーキテクチャの非決定性によって引き起こされるログ確率の不一致を上書きし、方策上の完全性を回復する修正を貢献した（図2）。

謝辞

Deepak Agarwal、Bee-Chung Chen、Animesh Singh、Gungor Polatkan、Balaji Krishnapuram、Jitendra Agarwalのリーダーシップサポートに感謝する。

Feng, Jiazhan, et al. Retool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv preprint arXiv:2504.11536 (2025).

Xiao, Guangxuan, et al. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).

原文を表示

Back to Articles Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Upvote 59

Challenges of GPT-OSS RL Training

During the initial training runs, we observed exploding KL divergence and entropy, along with non-increasing rewards, indicating underlying issues in the GPT-OSS training setup, as shown in Figure 1.

Average Gradient Norm

Figure 1. Left: Qwen32b has significantly higher rewards compared to GPT-OSS 20B; Right: The gradient norm exploded as training progressed.

A Practical Debugging Journey in verl: Restoring PPO On-Policy Integrity

Restoring PPO On-Policy Integrity: A Fix for MoE Log-Probability Mismatch

Figure 2. Non-zero importance sampling clip value even for on-policy training.

ratio=π(a∣s)πold(a∣s) \text{ratio} = \frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)} ratio=πold(a∣s)π(a∣s)

This requirement ensures that the policy update is executed only on the data generated by the current policy π(a | s) = πold(a | s), preventing unintended clipping.

We have observed the non-zero clipping value in our ReTool training, as shown in Figure 2, stemming from a mismatch between the two log-probabilities:

Current log-probability log_prob

Old log-probability old_log_prob

Root Cause: The Dual Forward Pass and MoE Architecture

Prior to verl 0.3.0, the implementation relied on two separate forward passes (one to compute the current log_prob

This difference in routing leads to:

log⁡(π(a∣s))≠log⁡(πold(a∣s)) \log(\pi(a \mid s)) \neq \log(\pi_{\text{old}}(a \mid s)) log(π(a∣s))=log(πold(a∣s))

The resulting ratio deviates from 1, falsely triggering the PPO clip and violating the core on-policy assumption.

Solution: Enforcing Ratio = 1 via Log-Probability Substitution

The fix resolves the issue by logically overriding the flawed computation when the environment is known to be on-policy (i.e., when the minibatch size equals the global batch size):

if on_policy: old_log_prob = log_prob.detach() else: old_log_prob = model_inputs["old_log_probs"]

By setting old_log_prob

Correcting Training–Inference Mismatch

However, as shown in the left plot of Figure 4, the reward increases only modestly, and convergence on the simple GSM8K task remains substantially slower compared to smaller dense model variants.

Average Entropy

Average Gradient Norm

Average KL Loss

Max Log-Perplexity Difference

Attention Sink Support in FlashAttentionV3

After a deeper investigation, we identified several major issues:

verl hard-codes FlashAttention v2 in fsdp_worker

The attention sink backward pass is not supported in FlashAttention v2 and v3, so it does not work as expected even when FlashAttention v3 is enabled.

Standard Attention

scores = QK^T / sqrt(d) # [B, H, N_q, N_k] probs = softmax(scores, dim=-1) # Σ_j P_ij = 1 output = probs @ V # [B, H, N_q, d_v]

Attention with Sinks (GPT-OSS)

Key difference: The sink participates in softmax normalization but doesn't contribute to the output.

Mathematical Formulation

The attention weight for content token j in row i is defined as:

Pij=exp⁡(Sij)∑j′=1Nkexp⁡(Sij′)+exp⁡(Sh) P_{ij} = \frac{\exp(S_{ij})} {\sum_{j'=1}^{N_k} \exp(S_{ij'}) + \exp(S_h)} Pij=∑j′=1Nkexp(Sij′)+exp(Sh)exp(Sij)

Sij = Qi Kj⊤ / √d are the attention scores

Pij are the attention weights for the content tokens

Sh is the learnable sink parameter for head h

Sink Probability:

The sink probability is computed but not used in the output:

Pi,h=exp⁡(Sh)∑j′=1Nkexp⁡(Sij′)+exp⁡(Sh) P_{i,h} = \frac{\exp(S_h)} {\sum_{j'=1}^{N_k} \exp(S_{ij'}) + \exp(S_h)} Pi,h=∑j′=1Nkexp(Sij′)+exp(Sh)exp(Sh)

The gradient of the loss L with respect to the sink parameter Sh is:

Pi,h is the sink attention probability for row i

∂L/∂Sij is the gradient with respect to the attention scores, including the sink

Simplified Gradient:

Since the sink is computed but not used in the output, its gradient ∂L/∂Si,h = 0.

Therefore, the backward equation simplifies to:

Across all settings, training becomes stable and exhibits steady reward improvement.

Figure 5.. Single Turn GSM8K, the red curve converges much faster than the rest without the fix

Average Entropy

Average Gradient Norm

Figure 6. On verifiable instruction following the task, the run without the fix collapsed (blue), and the run with fix showed steady reward improvement.

Average Gradient Norm

Validation Accuracy

Memory-Efficient Training

Mitigating FSDP Memory Blow-Ups Caused by Repeated MoE Expert Materialization

We patched the Hugging Face implementation to use a more memory-efficient execution path, avoiding repeated materialization of experts.

Sequence Parallel with Flash Attention V3

During the forward pass, when computation reaches a layer whose parameters are not locally available, an all_gather

We made contributions that transformed the viability of GPT-OSS for agentic applications, specifically by:

Stabilizing PPO: We contributed a fix to restore on-policy integrity, overriding the log-probability mismatch caused by the MoE architecture’s non-determinism (Figure 2).

These engineering efforts validate GPT-OSS as a scalable and high-performance backbone for building the next generation of intelligent, multi-step decision-making agents.

Acknowledgments

Thanks to Deepak Agarwal, Bee-Chung Chen, Animesh Singh, Gungor Polatkan, Balaji Krishnapuram, and Jitendra Agarwal for their leadership support.

Feng, Jiazhan, et al. Retool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv preprint arXiv:2504.11536 (2025).

Xiao, Guangxuan, et al. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).