Hugging Face Blog·2026年3月9日 09:00·約17分

ユリシーズ・シーケンス並列処理：100万トークンのコンテキストでのトレーニング

#長文脈訓練 #分散学習 #大規模言語モデル #Hugging Face #GPU最適化 #注意機構

TL;DR

Snowflake AI Researchが開発したUlysses Sequence Parallelismは、複数のGPUに注意計算を分散させることで、100万トークン規模の長文脈での大規模言語モデル訓練を可能にする技術であり、Hugging FaceのAccelerate、Transformers Trainer、TRLのSFTTrainerに統合された。

AI深層分析2026年3月10日 04:44

重要/ 5段階

深度40%

キーポイント

長文脈訓練の課題

トランスフォーマーの注意機構は系列長の二乗で計算量・メモリ使用量が増大し、単一GPUでは数万トークンを超える文脈の訓練が困難である。

Ulyssesの仕組み

注意ヘッド並列化を通じて注意計算を複数GPUに分散させ、従来のデータ並列化では解決できなかった長系列処理のメモリ課題を解決する。

Hugging Faceエコシステムへの統合

Accelerate、Transformers Trainer、TRLのSFTTrainerにUlyssesが統合され、研究者・開発者が容易に長文脈訓練を実装できるようになった。

長文脈訓練の応用分野

書籍・法律文書の理解、大規模コードベース分析、複数ステップの推論タスク、検索拡張生成など、現実的なAIタスクで必要とされる。

影響分析・編集コメントを表示

影響分析

この技術は、書籍全体や大規模コードベースの理解など、現実世界の複雑なタスクにLLMを適用する際の根本的な制約を緩和する。Hugging Faceエコシステムへの統合により、広範な研究者・開発者がアクセス可能となり、長文脈AIモデルの開発・普及を加速させる可能性が高い。

編集コメント

理論的なブレークスルーだけでなく、主要なオープンソースプラットフォームへの実装という実用面での進展が、このニュースの真の価値を高めている。長文脈AIの実用化に向けた重要なマイルストーンと言える。

記事一覧に戻る

Ulysses シーケンス並列処理：百万トークンのコンテキストでのトレーニング

アップvote - Kashif Rasul kashif フォロー Stas Bekman stas フォロー

長いシーケンスにおける大規模言語モデルのトレーニングは、能力の高い AI システムを構築するために不可欠となっています。モデルが文書分析、コード理解、複雑な推論、RAG（Retrieval-Augmented Generation：検索拡張生成）ワークロードなどのタスクにますます使用されるようになり、数十万、あるいは百万単位のトークンからなるシーケンスを処理する必要性が劇的に高まっています。これを比較のために説明すると、平均的な書籍は約 250k トークン程度であるため、複数文書のコンテキストや書籍長の入力に対するトレーニングには、1 つの GPU に収まる範囲をはるかに超えるシーケンスを扱う必要があります。しかし、このような長いコンテキストでのトレーニングには大きなメモリ上の課題が伴います：アテンション計算はシーケンス長に対して二次関数的にスケーリングするため、数万トークンを超えるコンテキストではすぐに GPU メモリを超過してしまいます。

Ulysses Sequence Parallelism（Snowflake AI Research の Arctic Long Sequence Training (ALST) プロトコルの一部）は、アテンションヘッド並列化を通じてアテンション計算を複数の GPU に分散させることで、エレガントな解決策を提供します。本稿では、Ulysses がどのように動作し、Hugging Face エコシステム全体（Accelerate から Transformers Trainer、TRL の SFTTrainer まで）にどのように統合されているかを探索していきます。

長期シーケンストレーニングの課題

Ulysses の仕組み

Accelerate との統合

Transformers Trainer との統合

TRL の SFTTrainer との統合

Ulysses と Ring Attention の比較

長期シーケンストレーニングの課題

トランスフォーマーにおけるアテンション機構は、シーケンス長に対して二次関数的にスケーリングします。長さ n のシーケンスの場合、標準的なアテンションでは、アテンションスコア行列を計算・保存するために O(n^2) の FLOPs と O(n^2) のメモリが必要です。FlashAttention などの最適化された実装は、計算をタイル分割し、完全なアテンション行列を実体化しないことでメモリ使用量を O(n) に削減しますが、O(n^2) の計算量は依然として残ります。非常に長いシーケンス（32k トークン以上）の場合、FlashAttention を用いたとしても、トレーニングは単一 GPU のメモリの限界を押し上げる状態になります。

長期コンテキストトレーニングが不可欠となるこれらのシナリオを考えてみましょう：

ドキュメント理解：書籍全体、法的文書、または研究論文の処理

コード分析：複数の相互接続されたファイルを持つ大規模なコードベースの理解

推論タスク：ステップバイステップで「思考」するモデルは、推論中に数千のトークンを生成する可能性があります

検索拡張生成（Retrieval-Augmented Generation）：文脈に多数の検索結果を統合する

従来のデータ並列化ではここでの問題は解決できません。各 GPU はアテンションブロック内の完全なシーケンスを処理し続ける必要があるからです。シーケンス自体を複数のデバイス間で分割する方法が必要です。

Ulysses の仕組み

DeepSpeed Ulysses 論文で導入された Ulysses シーケンス並列化（SP）は、巧妙なアプローチを採用しています。シーケンス次元での分割に加えて、アテンションヘッドも GPU 間で分割します。

動作の詳細は以下の通りです：

シーケンスのシャード化：入力シーケンスを P 個の GPU に沿ってシーケンス次元で分割します。各 GPU i はトークン [i⋅n/P, (i+1)⋅n/P) を保持します。

QKV プロジェクション：各 GPU は、ローカルのシーケンスチャンクに対するクエリ、キー、バリューのプロジェクションを計算します。

All-to-All 通信：All-to-All コレクティブ操作によりデータが再配布され、各 GPU はプロジェクション後の全シーケンス位置のデータを保持しますが、アテンションヘッドの一部のみを対象とします。

ローカルアテンション：各 GPU は、割り当てられたヘッドに対して標準的なアテンションメカニズム（FlashAttention または SDPA）を用いてアテンションを計算します。

All-to-All 通信：もう一つの All-to-All 操作により再配布が逆転し、シーケンスシャード形式に戻ります。

出力プロジェクション：各 GPU はローカルのシーケンスチャンクに対する出力プロジェクションを計算します。

鍵となる洞察は、アテンションヘッドが独立しているという点です。各ヘッドは個別に計算できます。シーケンスの局所性をヘッドの局所性と交換することで、Ulysses は比較的低い通信オーバーヘッドで効率的な並列化を可能にします。

通信複雑度

Ulysses は、アテンション層ごとに 2 つの all-to-all（全対全）操作を必要とし、GPU あたりの総通信量は O(n⋅d/P) です。ここで:

n はシーケンス長

d は隠れ次元

P は並列化度です。

Ring Attention（リングアテンション）は、リング上を P-1 回の逐次的なポイントツーポイント転送を行うことで、GPU あたり O(n⋅d) の通信量を要します。これは Ulysses と比較して P 倍の量です。Ulysses はまた、all-to-all が単一の集合ステップでフルバイセクション帯域幅を活用できるため、より低いレイテンシの恩恵も受けます。一方、Ring Attention は P-1 ホップにわたって直列化されます。

Accelerate との統合

Accelerate は、ParallelismConfig を通じて Ulysses シーケンス並列化の基盤を提供します。

from accelerate import Accelerator

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(

sp_backend="deepspeed",

sp_size=4, # Split across 4 GPUs (4 つの GPU に分割)

dp_shard_size=1, # Must satisfy: dp_replicate × dp_shard × sp_size = num_processes

sp_handler=DeepSpeedSequenceParallelConfig(

sp_seq_length=None, # None for variable-length sequences (可変長シーケンスの場合は None)

sp_seq_length_is_variable=True,

sp_attn_implementation="flash_attention_2", # or "sdpa"

)

accelerator = Accelerator(parallelism_config=parallelism_config)

Number of GPUs for sequence parallelism (シーケンス並列化に使用する GPU の数)

Must be "deepspeed" (必ず "deepspeed" にすること)

sp_seq_length_is_variable (シーケンス長が可変かどうか)

sp_attn_implementation (アテンション実装方式)

"flash_attention_2"

"flash_attention_3"

Using the Accelerator (Accelerator の使用法)

When you call accelerator.prepare() (accelerator.prepare() を呼び出すとき)

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

This registers the model with Ulysses and wraps the dataloader (この処理により、モデルが Ulysses に登録され、データローダーがラップされます)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Registers the model with DeepSpeed's UlyssesSPAttentionHF (DeepSpeed の UlyssesSPAttentionHF にモデルを登録します)

Wraps the dataloader with UlyssesSPDataLoaderAdapter (UlyssesSPDataLoaderAdapter でデータローダーをラップします)

Automatically injects shift_labels (自動的に shift_labels を注入します)

Loss Aggregation (損失集約)

With Ulysses, each GPU computes loss on different parts of the sequence. The losses must be aggregated properly, weighted by the number of valid tokens per rank. If you're using the Transformers Trainer (Ulysses を使用すると、各 GPU はシーケンスの異なる部分に対して損失を計算します。損失は適切に集約されなければならず、ランクごとの有効トークン数に応じて重み付けを行う必要があります。Transformers Trainer を使用している場合)

sp_size = parallelism_config.sp_size if sp_size > 1: from deepspeed.utils import groups sp_group = groups._get_sequence_parallel_group() # Gather losses and token counts from all SP ranks losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group) good_tokens = (batch["shift_labels"] != -100).view(-1).sum() good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group) # Weighted aggregation total_loss = sum( losses_per_rank[i] * good_tokens_per_rank[i] for i in range(sp_size) if good_tokens_per_rank[i] > 0 ) loss = total_loss / max(sum(good_tokens_per_rank), 1) accelerator.backward(loss)

重み付き損失集約により、トークンがランク間で不均等に分布している場合（例えば、一部のランクにパディングのみまたはマスクされたプロンプトトークンが含まれている場合）でも、正しい勾配が保証されます。

Ulysses と Ring Attention の両方とも position_ids を使用します。

disable_in_eval

Transformers Trainer への統合

The Transformers Trainer

TrainingArguments.parallelism_config

単に同じ parallelism_config を渡すだけです。

TrainingArguments

from transformers import TrainingArguments training_args = TrainingArguments( parallelism_config=parallelism_config, # same ParallelismConfig as above per_device_train_batch_size=1, )

Trainer が自動的に処理する機能

Dataloader Wrapping: モデル準備後、Trainer は dataloader を UlyssesSPDataLoaderAdapter でラップします。

Loss Computation: compute_loss 関数（_deepspeed_sp_compute_loss）が使用されます。

SP ランク間での損失集約

ランクごとの有効トークン数の計算

重み付き損失集約

バッチサイズ計算：データ並列の世界規模は SP を考慮した実効値である。

dp_world_size = world_size // sp_size

Dataloader 長さの調整：SP が反復回数に与える影響を反映して、トレーニングステップの計算が調整される。

accelerate の設定ファイルまたはコマンドライン引数を使用する:

accelerate launch \

--config_file deepspeed_ulysses.yaml \

train.py \

--per_device_train_batch_size 1

TRL SFTTrainer との統合

TRL の SFTTrainer

from trl import SFTConfig, SFTTrainer

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(

sp_backend="deepspeed",

sp_size=2,

dp_shard_size=2, # 2D 並列：SP × DP = 4 GPU

sp_handler=DeepSpeedSequenceParallelConfig(

sp_seq_length_is_variable=True,

sp_attn_implementation="flash_attention_2",

)

training_args = SFTConfig(

...,

parallelism_config=parallelism_config,

max_length=32768,

pad_to_multiple_of=2, # SP サイズ（sp_size）と一致する必要がある

per_device_train_batch_size=1,

)

trainer = SFTTrainer(

model=model,

args=training_args,

train_dataset=dataset,

)

trainer.train()

Ulysses 向け SFTConfig の主要パラメータ

pad_to_multiple_of

SP サイズ（sp_size）と一致する必要がある。

分割前のグローバルシーケンス長

SP と相性が良く、パッキングによりパディングの無駄を削減できる。特に可変長のシーケンスにおいて効果的である。

Accelerate 設定ファイル

alst_ulysses_4gpu.yaml を作成する

compute_environment: LOCAL_MACHINE distributed_type: DEEPSPEED mixed_precision: bf16 num_processes: 4 deepspeed_config: zero_stage: 3 seq_parallel_communication_data_type: bf16 parallelism_config: parallelism_config_sp_size: 2 parallelism_config_sp_backend: deepspeed parallelism_config_dp_shard_size: 2 parallelism_config_sp_seq_length_is_variable: true parallelism_config_sp_attn_implementation: flash_attention_2

完全なトレーニングコマンド

accelerate launch --config_file alst_ulysses_4gpu.yaml \

trl/scripts/sft.py \

--model_name_or_path meta-llama/Llama-3.1-8B \

--dataset_name trl-lib/Capybara \

--max_length 32768 \

--packing \

--pad_to_multiple_of 2 \

--per_device_train_batch_size 1

ラベルのシフト処理

SFTTrainer は Ulysses が有効になっている場合、事前にシフトされたラベルを自動的に処理します:

SP を使用する場合、データローダーアダプターによってラベルが事前にシフトされる # トレーナーはこれを検出し、shift_labels を直接使用する labels = inputs["labels"] if "shift_labels" not in inputs else None # 損失計算では事前シフトされたラベルを使用する if "shift_labels" in inputs: shift_logits = outputs.logits.contiguous() shift_labels = inputs["shift_labels"] else: shift_logits = outputs.logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous()

Ulysses と Ring Attention の比較

両方の Ulysses と Ring Attention は長文コンテキストのトレーニングを可能にしますが、それぞれ異なる特徴を持っています:

Ulysses (DeepSpeed)

Ring Attention (FSDP2)

並列化手法

アテンションヘッドのパーティショニング

リングベースの KV 交換

アテンションサポート

FlashAttention 2/3、SDPA

P2P リンク通信

GPU あたりの通信量

O(total_seq x hidden / sp_size)

O(total_seq x hidden)

シーケンスの整除性

ヘッド数の制約

num_heads >= sp_size

Ulysses とリングアテンションの使い分け

両者の切り替えには accelerate 設定の変更のみで済むため、特定の環境においてパフォーマンスとメモリ使用量を比較するために、両方を試すことを推奨します。主な制約は、Ulysses では num_heads >= sp_size の条件が必要となる点です。

シーケンス長の整除性

シーケンス長が常に sp_size で割り切れるようにしてください。

training_args = SFTConfig( pad_to_multiple_of=4, # sp_size=4 の場合 max_length=32768, # 4 で割り切れる必要がある )

Flash Attention の使用

Flash Attention 2 は、SDPA よりも出力が明確でパフォーマンスに優れています:

parallelism_config = ParallelismConfig( sp_handler=DeepSpeedSequenceParallelConfig( sp_attn_implementation="flash_attention_2", ), )

Hopper アーキテクチャでは Flash Attention 3 を使用し、Blackwell 向けに Flash Attention 4 のリリースを待ってください（Blackwell 上での FA2 は非常に遅いです）。

DeepSpeed ZeRO との併用

非常に大規模なモデルの場合、Ulysses を ZeRO Stage 3 と組み合わせて使用してください:

deepspeed_config: zero_stage: 3 offload_optimizer: device: cpu

モデルが極めて巨大な場合は、上記設定に以下を追加してパラメータのオフロードも行うことができます:

offload_param: device: cpu

メモリ断片化に強い PyTorch アロケータの使用

この環境変数により、より長いシーケンス長が可能になります：

export PYTORCH_ALLOC_CONF=expandable_segments:True

2D Parallelism Configuration

GPU の数に合わせて SP と DP をバランスよく設定してください:

スループットとシーケンス長のバランス

最大シーケンス長

中程度のシーケンス長で高いスループット

中程度のスループットでより長いシーケンス

覚えておいてください：dp_replicate_size × dp_shard_size × sp_size = num_processes

Liger-Kernel

目的のモデルアーキテクチャが Liger-Kernel でサポートされている場合、Ulysses SP と完全に互換性があり、単一のフラグで有効化できます:

training_args = SFTConfig( use_liger_kernel=True, )

主なメモリ節約効果は FusedLinearCrossEntropy から得られます。

さらに、TiledMLP も有効にできます。

FusedLinearCrossEntropy

Token Distribution Across Ranks

SP ランク間で手動でトークンをバランスさせる必要はありません。損失集約コードが不均一な分布（有効なトークンがゼロのランクを含む）を適切に処理します。十分に大きなデータセットに対してランダムバッチングを行う場合、トレーニングを通じて統計的に分布は均等化されます。

Ulysses SP の利点を定量化するために、TRL の SFTTrainer を使用して Qwen3-4B を Gutenberg English streaming データセットでトレーニングしました。すべての実験は H100 80GB GPU で実行され、DeepSpeed ZeRO-3、CPU オプティマイザオフローディング、勾配チェックポイント、およびアテンションバックエンドとして flash-attn2 が使用されました。

上記の表にあるベンチマーク実行は、同じグローバルバッチサイズ（マイクロバッチ 8）、コサイン学習率スケジューリング、およびシードを使用しているため、これらのベンチマーク損失曲線は直接比較可能です。

損失曲線マッチング診断（4 GPU）

SP と DP の損失同等性を検証するため、シード、モデル、オプティマイザ、学習率スケジューリング、データ順序をすべて同一にした制御された 4-GPU A/B 実験を実施しました。

公平な DP 対 SP 比較のための方法論

比較対象セットアップ：

DP=4, SP=1, GAS=1

DP=1, SP=4, GAS=4

公平な比較のために、GAS（Gradient Accumulation Steps: グラデント累積ステップ）を調整し、Ulysses SP はシーケンスを SP 間で分割します。

DP トークン/ステップ：dp_world_size * micro_batch * seq_len * GAS = 4 * B * L * 1

SP トークン/ステップ：dp_world_size * micro_batch * (L/SP) * GAS * SP_ranks = 1 * B * (L/4) * 4 * 4 = 4 * B * L

制御された同等性ハッチス上で 20 ステップにわたって 4 GPU で測定：

DP vs SP セットアップ

DP=4, SP=1 vs DP=1, SP=4

教訓：トークン予算を一致させた条件下では、SP と非 SP は正規化されたトークン損失において一致します。残りの差異はトレーナーが報告するログ（損失値の記録方法など）にあります。

メモリ削減

ベースライン — SP なし

同じシーケンス長で同等のメモリ使用量

DP ベースラインより 4 倍長い

DP ベースラインより 8 倍長い

DP ベースラインより 12 倍長い

80 GB の制限を超える

8K トークンでは、DP=4 と SP=4 は GPU あたりほぼ同じメモリ（ZeRO-3 を使用して約 22 GB）を使用します。SP の利点は、はるかに長いシーケンスへのスケーリングを可能にすることです：96K トークン（12 倍の長さ）ではピークメモリが 66 GB に達しますが、これは H100 の 80 GB キャパシティ内に収まります。128K ではモデルが OOM（Out Of Memory: メモリ不足）となり、この構成における実用的な限界を示します。SP なしの DP=4 は、このモデルでは 8K を超えてスケーリングできません。

ベースライン（1 GPU）

同じシーケンス長（8K）において、SP=4 は単一 GPU ベースラインと比較して同等のスループットを示します。NVLink で接続された GPU 上では、all-to-all コミュニケーションのオーバーヘッドは最小限です。真の利点はより長いシーケンスで現れます：シーケンス長が成長するにつれ、2 次計算を要するアテンション（attention）処理が通信やその他のオーバーヘッドを上回り、各トレーニングステップがますます計算効率的になります。また、各ステップで処理されるトークン数も比例して増えるため、スループットはシーケンス長とともにスケールします。64K の場合、SP=4 は 1 秒あたり 13,396 トークンを処理し、ベースラインの 3.7 倍です。

これらの結果は SP=4 で 4 台の GPU のみを使用しています。8 台の GPU（SP=8）を使用すれば、さらに長いシーケンス、最大 256K+ トークンまで拡張することも可能です。あるいは、2D パラレルism（SP=4, DP=2）を採用して、長文コンテキストトレーニングとデータ並列処理によるスループットを組み合わせることもできます。

HF Accelerate: deepspeed>=0.18.1 accelerate>=1.12

HF Trainer: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0

HF TRL: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0 trl>=0.18.0

flash_attention_2 を使用してください。

flash_attention_3

flash_attention_4

Accelerate: Context Parallelism Guide

TRL: Distributing Training

DeepSpeed Sequence Parallelism

Accelerate ALST Example

TRL Accelerate Configs

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Related Blog Posts

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Understanding Ulysses and Ring Attention

Axolotl におけるシーケンス並列性による長文脈トレーニングの実現

Ulysses は入力シーケンスをシーケンス次元に沿って分割し、すべての GPU がアテンションヘッドのサブセットを計算できるようにするために、キー・バリューペアの交換にオール・トゥー・オール通信を使用します。(出典：Snowflake Engineering Blog)

Gutenberg テキスト（20 ステップ）において、DP=4, SP=1, GAS=1 と DP=1, SP=4, GAS=4 の間で、ログ精度の範囲内で標準的な損失が一致しています。

SP=4 は同じシーケンス長において GPU あたりのメモリ使用量を 3.3 倍削減し、最大 96K トークンのトレーニングを 4× H100 80GB で可能にします。128K ではモデルが OOM（Out Of Memory）になります。

SP を使用したより長いシーケンスは、1 秒あたりに劇的に多くのトークンを処理します。64K で SP=4 とした場合、ベースラインの 3.7 倍のスループットを達成します。

原文を表示

Back to Articles Ulysses Sequence Parallelism: Training with Million-Token Contexts

Upvote - Kashif Rasul kashif Follow Stas Bekman stas Follow Training large language models on long sequences has become essential for building capable AI systems. As models are increasingly used for tasks like document analysis, code understanding, complex reasoning, and RAG workloads, the need to process sequences of hundreds of thousands—or even millions—of tokens has grown dramatically. To put this in perspective, an average book is roughly 250k tokens, so training on multi-document contexts or book-length inputs requires handling sequences well beyond what fits on a single GPU. However, training with such long contexts presents significant memory challenges: the attention computation scales quadratically with sequence length, quickly exceeding GPU memory for contexts beyond tens of thousands of tokens.

Ulysses Sequence Parallelism (part of the Arctic Long Sequence Training (ALST) protocol from Snowflake AI Research) provides an elegant solution by distributing the attention computation across multiple GPUs through attention head parallelism. In this post, we'll explore how Ulysses works and how it's been integrated across the Hugging Face ecosystem—from Accelerate to the Transformers Trainer and TRL's SFTTrainer.

The Challenge of Long Sequence Training

How Ulysses Works

Integration with Accelerate

Integration with Transformers Trainer

Integration with TRL's SFTTrainer

Comparing Ulysses and Ring Attention

The Challenge of Long Sequence Training

The attention mechanism in transformers scales quadratically with sequence length. For a sequence of length n n n, standard attention requires O(n2) O(n^2) O(n2) FLOPs and O(n2) O(n^2) O(n2) memory to compute and store the attention score matrix. Optimized implementations like FlashAttention reduce the memory to O(n) O(n) O(n) by tiling the computation and never materializing the full attention matrix—but the O(n2) O(n^2) O(n2) compute remains. For very long sequences (32k+ tokens), even with FlashAttention, training still pushes the limits of single-GPU memory.

Consider these scenarios where long-context training is essential:

Document understanding: Processing entire books, legal documents, or research papers

Code analysis: Understanding large codebases with multiple interconnected files

Reasoning tasks: Models that "think" step-by-step may generate thousands of tokens during inference

Retrieval-augmented generation: Incorporating many retrieved passages into the context

Traditional data parallelism doesn't help here—each GPU still needs to process the full sequence inside the attention block. We need a way to split the sequence itself across multiple devices.

How Ulysses Works

Ulysses Sequence Parallelism (SP), introduced in the DeepSpeed Ulysses paper, takes a clever approach: in addition to splitting on the sequence dimension, it also partitions the attention heads across GPUs.

Here's how it works:

Sequence Sharding: The input sequence is split along the sequence dimension across P P P GPUs. Each GPU i i i holds tokens [i⋅n/P,(i+1)⋅n/P) [i \cdot n/P, (i+1) \cdot n/P) [i⋅n/P,(i+1)⋅n/P).

QKV Projection: Each GPU computes the query, key, and value projections for its local sequence chunk.

All-to-All Communication: An all-to-all collective operation redistributes the data so that each GPU holds all sequence positions after the projections, but only for a subset of attention heads.

Local Attention: Each GPU computes attention for its assigned heads using standard attention mechanisms (FlashAttention or SDPA).

All-to-All Communication: Another all-to-all operation reverses the redistribution, returning to sequence-sharded format.

Output Projection: Each GPU computes the output projection for its local sequence chunk.

The key insight is that attention heads are independent—each head can be computed separately. By trading sequence locality for head locality, Ulysses enables efficient parallelization with relatively low communication overhead.

Communication Complexity

Ulysses requires two all-to-all operations per attention layer, with total communication volume of O(n⋅d/P) O(n \cdot d / P) O(n⋅d/P) per GPU, where:

n n n is the sequence length

d d d is the hidden dimension

P P P is the parallelism degree

Ring Attention communicates O(n⋅d) O(n \cdot d) O(n⋅d) per GPU — a factor of P P P more — via P−1 P-1 P−1 sequential point-to-point transfers around the ring. Ulysses also benefits from lower latency because all-to-all can exploit full bisectional bandwidth in a single collective step, whereas Ring Attention serializes over P−1 P-1 P−1 hops.

Integration with Accelerate

Accelerate provides the foundation for Ulysses sequence parallelism through its ParallelismConfig

from accelerate import Accelerator from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig parallelism_config = ParallelismConfig( sp_backend="deepspeed", sp_size=4, # Split across 4 GPUs dp_shard_size=1, # Must satisfy: dp_replicate × dp_shard × sp_size = num_processes sp_handler=DeepSpeedSequenceParallelConfig( sp_seq_length=None, # None for variable-length sequences sp_seq_length_is_variable=True, sp_attn_implementation="flash_attention_2", # or "sdpa" ), ) accelerator = Accelerator(parallelism_config=parallelism_config)

Number of GPUs for sequence parallelism

Must be "deepspeed"

sp_seq_length_is_variable

sp_attn_implementation

"flash_attention_2"

"flash_attention_3"

Using the Accelerator

When you call accelerator.prepare()

from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) # This registers the model with Ulysses and wraps the dataloader model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Registers the model with DeepSpeed's UlyssesSPAttentionHF

Wraps the dataloader with UlyssesSPDataLoaderAdapter

Automatically injects shift_labels

Loss Aggregation

The weighted loss aggregation ensures correct gradients when tokens are unevenly distributed across ranks (e.g., when some ranks contain only padding or masked out prompt tokens).

Both Ulysses and Ring Attention use position_ids

disable_in_eval

Integration with Transformers Trainer

The Transformers Trainer

TrainingArguments.parallelism_config

Just pass the same parallelism_config

TrainingArguments

from transformers import TrainingArguments training_args = TrainingArguments( parallelism_config=parallelism_config, # same ParallelismConfig as above per_device_train_batch_size=1, )

What the Trainer Handles Automatically

Dataloader Wrapping: After model preparation, the Trainer wraps the dataloader with UlyssesSPDataLoaderAdapter

Loss Computation: The compute_loss

_deepspeed_sp_compute_loss

Gathering losses across SP ranks

Computing valid token counts per rank

Weighted loss aggregation

Batch Size Calculation: The effective data parallel world size accounts for SP:

dp_world_size = world_size // sp_size

Dataloader Length Adjustment: Training step calculations are adjusted for SP's effect on iteration count

Use an accelerate config file or command-line arguments:

accelerate launch \ --config_file deepspeed_ulysses.yaml \ train.py \ --per_device_train_batch_size 1

Integration with TRL SFTTrainer

TRL's SFTTrainer

from trl import SFTConfig, SFTTrainer from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig parallelism_config = ParallelismConfig( sp_backend="deepspeed", sp_size=2, dp_shard_size=2, # 2D parallelism: SP × DP = 4 GPUs sp_handler=DeepSpeedSequenceParallelConfig( sp_seq_length_is_variable=True, sp_attn_implementation="flash_attention_2", ), ) training_args = SFTConfig( ..., parallelism_config=parallelism_config, max_length=32768, pad_to_multiple_of=2, # Must equal sp_size per_device_train_batch_size=1, ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, ) trainer.train()

Key SFTConfig Parameters for Ulysses

pad_to_multiple_of

Must equal sp_size

Global sequence length (before splitting across GPUs)

Works well with SP — packing reduces padding waste, especially for variable-length sequences

Accelerate Config File

Create alst_ulysses_4gpu.yaml

Complete Training Command

accelerate launch --config_file alst_ulysses_4gpu.yaml \ trl/scripts/sft.py \ --model_name_or_path meta-llama/Llama-3.1-8B \ --dataset_name trl-lib/Capybara \ --max_length 32768 \ --packing \ --pad_to_multiple_of 2 \ --per_device_train_batch_size 1

Shift Labels Handling

The SFTTrainer automatically handles pre-shifted labels when Ulysses is enabled:

When using SP, labels are pre-shifted by the dataloader adapter # The trainer detects this and uses shift_labels directly labels = inputs["labels"] if "shift_labels" not in inputs else None # Loss computation uses the pre-shifted labels if "shift_labels" in inputs: shift_logits = outputs.logits.contiguous() shift_labels = inputs["shift_labels"] else: shift_logits = outputs.logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous()

Comparing Ulysses and Ring Attention

Both Ulysses and Ring Attention enable long-context training, but they have different characteristics:

Ulysses (DeepSpeed)

Ring Attention (FSDP2)

Parallelism Method

Attention head partitioning

Ring-based KV exchange

Attention Support

FlashAttention 2/3, SDPA

P2P ring communication

Comm volume per GPU

O(total_seq x hidden / sp_size)

O(total_seq x hidden)

Sequence Divisibility

Num Head Constraint

num_heads >= sp_size

When to Choose Ulysses vs Ring Attention

Since switching between the two only requires changing the accelerate config, we recommend trying both and comparing performance and memory usage on your specific setup. The main constraint is that Ulysses requires num_heads >= sp_size

Sequence Length Divisibility

Always ensure your sequence length is divisible by sp_size

training_args = SFTConfig( pad_to_multiple_of=4, # For sp_size=4 max_length=32768, # Must be divisible by 4 )

Use Flash Attention

Flash Attention 2 provides cleaner output and better performance than SDPA:

parallelism_config = ParallelismConfig( sp_handler=DeepSpeedSequenceParallelConfig( sp_attn_implementation="flash_attention_2", ), )

Use Flash Attention 3 for Hopper and look out for Flash Attention 4 release for Blackwell (FA2 on Blackwell is quite slow).

Combine with DeepSpeed ZeRO

For very large models, combine Ulysses with ZeRO Stage 3:

deepspeed_config: zero_stage: 3 offload_optimizer: device: cpu

If the model is huge, you can offload the params as well by adding to the above:

offload_param: device: cpu

Use memory fragmentation-friendly PyTorch allocator

This environment variable will allow for a longer sequence length:

export PYTORCH_ALLOC_CONF=expandable_segments:True

2D Parallelism Configuration

Balance SP and DP for your GPU count:

Balanced throughput and sequence length

Maximum sequence length

Higher throughput with moderate sequence length

Longer sequences with moderate throughput

Remember: dp_replicate_size × dp_shard_size × sp_size = num_processes

Liger-Kernel

If your desired model architecture is supported by Liger-Kernel, it is fully compatible with Ulysses SP and can be enabled with a single flag:

training_args = SFTConfig( use_liger_kernel=True, )

The main memory saving comes from FusedLinearCrossEntropy

Additionally, you can enable TiledMLP

FusedLinearCrossEntropy

Token Distribution Across Ranks

You don't need to worry about manually balancing tokens across SP ranks—the loss aggregation code handles uneven distributions gracefully (including ranks with zero valid tokens). With random batching over a reasonably sized dataset, the distribution evens out statistically over training.

To quantify the benefits of Ulysses SP, we trained Qwen3-4B on the Gutenberg English streaming dataset using TRL's SFTTrainer. All experiments ran on H100 80GB GPUs with DeepSpeed ZeRO-3, CPU optimizer offloading, gradient checkpointing, and flash-attn2 as the attention backend.

The benchmark runs in the table above use the same global batch size (8 micro-batches), cosine learning-rate schedule, and seed, so those benchmark loss curves are directly comparable.

Loss Curve Matching Diagnostics (4 GPU)

To verify SP-vs-DP loss equivalence, we ran controlled 4-GPU A/B experiments with identical seed, model, optimizer, learning-rate schedule, and data order.

Methodology for Fair DP vs SP Comparison

Compared setups:

DP=4, SP=1, GAS=1

DP=1, SP=4, GAS=4

For fair comparison, GAS

Ulysses SP splits the sequence across SP

DP tokens/step: dp_world_size * micro_batch * seq_len * GAS = 4 * B * L * 1

SP tokens/step: dp_world_size * micro_batch * (L/SP) * GAS * SP_ranks = 1 * B * (L/4) * 4 * 4 = 4 * B * L

Measured over 20 steps on 4 GPUs in controlled equivalence harnesses:

DP vs SP setting

DP=4, SP=1 vs DP=1, SP=4

Takeaway: under matched token budget, SP and non-SP match on canonical token-normalized loss. The remaining difference is in trainer-reported logging (loss

Memory Reduction

Baseline — no SP

Similar memory at same seq length

4x longer than DP baseline

8x longer than DP baseline

12x longer than DP baseline

Exceeds 80 GB limit

At 8K tokens, DP=4 and SP=4 use nearly the same memory per GPU (~22 GB with ZeRO-3). The advantage of SP is that it enables scaling to much longer sequences: at 96K tokens (12x longer), peak memory is 66 GB — still within the H100's 80 GB capacity. At 128K, the model OOMs, establishing the practical limit for this configuration. DP=4 without SP cannot scale beyond 8K for this model.

Baseline (1 GPU)

At the same sequence length (8K), SP=4 has comparable throughput to the single-GPU baseline — the all-to-all communication overhead is minimal on NVLink-connected GPUs. The real benefit comes from longer sequences: as sequence length grows, the quadratic attention computation dominates over communication and other overheads, making each training step increasingly compute-efficient. Each step also processes proportionally more tokens, so throughput scales with sequence length. At 64K, SP=4 processes 13,396 tokens/second — 3.7x the baseline.

These results use only 4 GPUs with SP=4. With 8 GPUs (SP=8), you can push to even longer sequences — up to 256K+ tokens — or use 2D parallelism (SP=4, DP=2) to combine long-context training with data-parallel throughput.

HF Accelerate: deepspeed>=0.18.1 accelerate>=1.12

HF Trainer: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0

HF TRL: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0 trl>=0.18.0

Use flash_attention_2

flash_attention_3

flash_attention_4

Accelerate: Context Parallelism Guide

TRL: Distributing Training

DeepSpeed Sequence Parallelism

Accelerate ALST Example

TRL Accelerate Configs

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Related Blog Posts

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Understanding Ulysses and Ring Attention

Enabling Long-Context Training with Sequence Parallelism in Axolotl

Ulysses splits input sequences along the sequence dimension and uses all-to-all communication to exchange key-value pairs, enabling each GPU to compute a subset of attention heads. (Source: Snowflake Engineering Blog)

On Gutenberg text (20 steps), canonical loss matches within logging precision between DP=4,SP=1,GAS=1 and DP=1,SP=4,GAS=4.

SP=4 reduces per-GPU memory by 3.3x at the same sequence length, enabling training at up to 96K tokens on 4× H100 80GB. At 128K, the model OOMs.

Longer sequences with SP process dramatically more tokens per second. SP=4 at 64K achieves 3.7x the throughput of the baseline.

この記事をシェア

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

Hugging Face Blog重要度42026年7月1日 03:32

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

Hugging Face Blog重要度42026年6月30日 23:39

専門化が不可避である理由

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年3月9日 09:00·約17分

ユリシーズ・シーケンス並列処理：100万トークンのコンテキストでのトレーニング

#長文脈訓練 #分散学習 #大規模言語モデル #Hugging Face #GPU最適化 #注意機構

TL;DR

AI深層分析2026年3月10日 04:44

重要/ 5段階

深度40%

キーポイント

長文脈訓練の課題

トランスフォーマーの注意機構は系列長の二乗で計算量・メモリ使用量が増大し、単一GPUでは数万トークンを超える文脈の訓練が困難である。

Ulyssesの仕組み

注意ヘッド並列化を通じて注意計算を複数GPUに分散させ、従来のデータ並列化では解決できなかった長系列処理のメモリ課題を解決する。

Hugging Faceエコシステムへの統合

Accelerate、Transformers Trainer、TRLのSFTTrainerにUlyssesが統合され、研究者・開発者が容易に長文脈訓練を実装できるようになった。

長文脈訓練の応用分野

書籍・法律文書の理解、大規模コードベース分析、複数ステップの推論タスク、検索拡張生成など、現実的なAIタスクで必要とされる。

影響分析・編集コメントを表示

影響分析

編集コメント

記事一覧に戻る

Ulysses シーケンス並列処理：百万トークンのコンテキストでのトレーニング

アップvote - Kashif Rasul kashif フォロー Stas Bekman stas フォロー

長期シーケンストレーニングの課題

Ulysses の仕組み

Accelerate との統合

Transformers Trainer との統合

TRL の SFTTrainer との統合

Ulysses と Ring Attention の比較

長期シーケンストレーニングの課題

長期コンテキストトレーニングが不可欠となるこれらのシナリオを考えてみましょう：

ドキュメント理解：書籍全体、法的文書、または研究論文の処理

コード分析：複数の相互接続されたファイルを持つ大規模なコードベースの理解

推論タスク：ステップバイステップで「思考」するモデルは、推論中に数千のトークンを生成する可能性があります

検索拡張生成（Retrieval-Augmented Generation）：文脈に多数の検索結果を統合する

Ulysses の仕組み

動作の詳細は以下の通りです：

シーケンスのシャード化：入力シーケンスを P 個の GPU に沿ってシーケンス次元で分割します。各 GPU i はトークン [i⋅n/P, (i+1)⋅n/P) を保持します。

QKV プロジェクション：各 GPU は、ローカルのシーケンスチャンクに対するクエリ、キー、バリューのプロジェクションを計算します。

All-to-All 通信：もう一つの All-to-All 操作により再配布が逆転し、シーケンスシャード形式に戻ります。

出力プロジェクション：各 GPU はローカルのシーケンスチャンクに対する出力プロジェクションを計算します。

通信複雑度

Ulysses は、アテンション層ごとに 2 つの all-to-all（全対全）操作を必要とし、GPU あたりの総通信量は O(n⋅d/P) です。ここで:

n はシーケンス長

d は隠れ次元

P は並列化度です。

Accelerate との統合

Accelerate は、ParallelismConfig を通じて Ulysses シーケンス並列化の基盤を提供します。

from accelerate import Accelerator

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(

sp_backend="deepspeed",

sp_size=4, # Split across 4 GPUs (4 つの GPU に分割)

dp_shard_size=1, # Must satisfy: dp_replicate × dp_shard × sp_size = num_processes

sp_handler=DeepSpeedSequenceParallelConfig(

sp_seq_length=None, # None for variable-length sequences (可変長シーケンスの場合は None)

sp_seq_length_is_variable=True,

sp_attn_implementation="flash_attention_2", # or "sdpa"

)

accelerator = Accelerator(parallelism_config=parallelism_config)

Number of GPUs for sequence parallelism (シーケンス並列化に使用する GPU の数)

Must be "deepspeed" (必ず "deepspeed" にすること)

sp_seq_length_is_variable (シーケンス長が可変かどうか)

sp_attn_implementation (アテンション実装方式)

"flash_attention_2"

"flash_attention_3"

Using the Accelerator (Accelerator の使用法)

When you call accelerator.prepare() (accelerator.prepare() を呼び出すとき)

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

This registers the model with Ulysses and wraps the dataloader (この処理により、モデルが Ulysses に登録され、データローダーがラップされます)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Registers the model with DeepSpeed's UlyssesSPAttentionHF (DeepSpeed の UlyssesSPAttentionHF にモデルを登録します)

Wraps the dataloader with UlyssesSPDataLoaderAdapter (UlyssesSPDataLoaderAdapter でデータローダーをラップします)

Automatically injects shift_labels (自動的に shift_labels を注入します)

Loss Aggregation (損失集約)

Ulysses と Ring Attention の両方とも position_ids を使用します。

disable_in_eval

Transformers Trainer への統合

The Transformers Trainer

TrainingArguments.parallelism_config

単に同じ parallelism_config を渡すだけです。

TrainingArguments

from transformers import TrainingArguments training_args = TrainingArguments( parallelism_config=parallelism_config, # same ParallelismConfig as above per_device_train_batch_size=1, )

Trainer が自動的に処理する機能

Dataloader Wrapping: モデル準備後、Trainer は dataloader を UlyssesSPDataLoaderAdapter でラップします。

Loss Computation: compute_loss 関数（_deepspeed_sp_compute_loss）が使用されます。

SP ランク間での損失集約

ランクごとの有効トークン数の計算

重み付き損失集約

バッチサイズ計算：データ並列の世界規模は SP を考慮した実効値である。

dp_world_size = world_size // sp_size

Dataloader 長さの調整：SP が反復回数に与える影響を反映して、トレーニングステップの計算が調整される。

accelerate の設定ファイルまたはコマンドライン引数を使用する:

accelerate launch \

--config_file deepspeed_ulysses.yaml \

train.py \

--per_device_train_batch_size 1

TRL SFTTrainer との統合

TRL の SFTTrainer

from trl import SFTConfig, SFTTrainer

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(

sp_backend="deepspeed",

sp_size=2,

dp_shard_size=2, # 2D 並列：SP × DP = 4 GPU

sp_handler=DeepSpeedSequenceParallelConfig(

sp_seq_length_is_variable=True,

sp_attn_implementation="flash_attention_2",

)

training_args = SFTConfig(

...,

parallelism_config=parallelism_config,

max_length=32768,

pad_to_multiple_of=2, # SP サイズ（sp_size）と一致する必要がある

per_device_train_batch_size=1,

)

trainer = SFTTrainer(

model=model,

args=training_args,

train_dataset=dataset,

)

trainer.train()

Ulysses 向け SFTConfig の主要パラメータ

pad_to_multiple_of

SP サイズ（sp_size）と一致する必要がある。

分割前のグローバルシーケンス長

SP と相性が良く、パッキングによりパディングの無駄を削減できる。特に可変長のシーケンスにおいて効果的である。

Accelerate 設定ファイル

alst_ulysses_4gpu.yaml を作成する

完全なトレーニングコマンド

accelerate launch --config_file alst_ulysses_4gpu.yaml \

trl/scripts/sft.py \

--model_name_or_path meta-llama/Llama-3.1-8B \

--dataset_name trl-lib/Capybara \

--max_length 32768 \

--packing \

--pad_to_multiple_of 2 \

--per_device_train_batch_size 1

ラベルのシフト処理

SFTTrainer は Ulysses が有効になっている場合、事前にシフトされたラベルを自動的に処理します:

SP を使用する場合、データローダーアダプターによってラベルが事前にシフトされる # トレーナーはこれを検出し、shift_labels を直接使用する labels = inputs["labels"] if "shift_labels" not in inputs else None # 損失計算では事前シフトされたラベルを使用する if "shift_labels" in inputs: shift_logits = outputs.logits.contiguous() shift_labels = inputs["shift_labels"] else: shift_logits = outputs.logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous()

Ulysses と Ring Attention の比較

両方の Ulysses と Ring Attention は長文コンテキストのトレーニングを可能にしますが、それぞれ異なる特徴を持っています:

Ulysses (DeepSpeed)

Ring Attention (FSDP2)

並列化手法

アテンションヘッドのパーティショニング

リングベースの KV 交換

アテンションサポート

FlashAttention 2/3、SDPA

P2P リンク通信

GPU あたりの通信量

O(total_seq x hidden / sp_size)

O(total_seq x hidden)

シーケンスの整除性

ヘッド数の制約

num_heads >= sp_size

Ulysses とリングアテンションの使い分け

シーケンス長の整除性

シーケンス長が常に sp_size で割り切れるようにしてください。

training_args = SFTConfig( pad_to_multiple_of=4, # sp_size=4 の場合 max_length=32768, # 4 で割り切れる必要がある )

Flash Attention の使用

Flash Attention 2 は、SDPA よりも出力が明確でパフォーマンスに優れています:

parallelism_config = ParallelismConfig( sp_handler=DeepSpeedSequenceParallelConfig( sp_attn_implementation="flash_attention_2", ), )

DeepSpeed ZeRO との併用

非常に大規模なモデルの場合、Ulysses を ZeRO Stage 3 と組み合わせて使用してください:

deepspeed_config: zero_stage: 3 offload_optimizer: device: cpu

モデルが極めて巨大な場合は、上記設定に以下を追加してパラメータのオフロードも行うことができます:

offload_param: device: cpu

メモリ断片化に強い PyTorch アロケータの使用

この環境変数により、より長いシーケンス長が可能になります：

export PYTORCH_ALLOC_CONF=expandable_segments:True

2D Parallelism Configuration

GPU の数に合わせて SP と DP をバランスよく設定してください:

スループットとシーケンス長のバランス

最大シーケンス長

中程度のシーケンス長で高いスループット

中程度のスループットでより長いシーケンス

覚えておいてください：dp_replicate_size × dp_shard_size × sp_size = num_processes

Liger-Kernel

目的のモデルアーキテクチャが Liger-Kernel でサポートされている場合、Ulysses SP と完全に互換性があり、単一のフラグで有効化できます:

training_args = SFTConfig( use_liger_kernel=True, )

主なメモリ節約効果は FusedLinearCrossEntropy から得られます。

さらに、TiledMLP も有効にできます。

FusedLinearCrossEntropy

Token Distribution Across Ranks

損失曲線マッチング診断（4 GPU）

公平な DP 対 SP 比較のための方法論

比較対象セットアップ：

DP=4, SP=1, GAS=1

DP=1, SP=4, GAS=4

公平な比較のために、GAS（Gradient Accumulation Steps: グラデント累積ステップ）を調整し、Ulysses SP はシーケンスを SP 間で分割します。

DP トークン/ステップ：dp_world_size * micro_batch * seq_len * GAS = 4 * B * L * 1

SP トークン/ステップ：dp_world_size * micro_batch * (L/SP) * GAS * SP_ranks = 1 * B * (L/4) * 4 * 4 = 4 * B * L

制御された同等性ハッチス上で 20 ステップにわたって 4 GPU で測定：

DP vs SP セットアップ

DP=4, SP=1 vs DP=1, SP=4

メモリ削減

ベースライン — SP なし

同じシーケンス長で同等のメモリ使用量

DP ベースラインより 4 倍長い

DP ベースラインより 8 倍長い

DP ベースラインより 12 倍長い

80 GB の制限を超える

ベースライン（1 GPU）

HF Accelerate: deepspeed>=0.18.1 accelerate>=1.12

HF Trainer: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0

HF TRL: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0 trl>=0.18.0

flash_attention_2 を使用してください。

flash_attention_3

flash_attention_4

Accelerate: Context Parallelism Guide

TRL: Distributing Training

DeepSpeed Sequence Parallelism

Accelerate ALST Example

TRL Accelerate Configs

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Related Blog Posts

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Understanding Ulysses and Ring Attention

Axolotl におけるシーケンス並列性による長文脈トレーニングの実現

Gutenberg テキスト（20 ステップ）において、DP=4, SP=1, GAS=1 と DP=1, SP=4, GAS=4 の間で、ログ精度の範囲内で標準的な損失が一致しています。

原文を表示

Back to Articles Ulysses Sequence Parallelism: Training with Million-Token Contexts

The Challenge of Long Sequence Training

How Ulysses Works

Integration with Accelerate

Integration with Transformers Trainer

Integration with TRL's SFTTrainer

Comparing Ulysses and Ring Attention

The Challenge of Long Sequence Training

Consider these scenarios where long-context training is essential:

Document understanding: Processing entire books, legal documents, or research papers

Code analysis: Understanding large codebases with multiple interconnected files

Reasoning tasks: Models that "think" step-by-step may generate thousands of tokens during inference

Retrieval-augmented generation: Incorporating many retrieved passages into the context

Traditional data parallelism doesn't help here—each GPU still needs to process the full sequence inside the attention block. We need a way to split the sequence itself across multiple devices.

How Ulysses Works

Here's how it works:

Sequence Sharding: The input sequence is split along the sequence dimension across P P P GPUs. Each GPU i i i holds tokens [i⋅n/P,(i+1)⋅n/P) [i \cdot n/P, (i+1) \cdot n/P) [i⋅n/P,(i+1)⋅n/P).

QKV Projection: Each GPU computes the query, key, and value projections for its local sequence chunk.

All-to-All Communication: An all-to-all collective operation redistributes the data so that each GPU holds all sequence positions after the projections, but only for a subset of attention heads.

Local Attention: Each GPU computes attention for its assigned heads using standard attention mechanisms (FlashAttention or SDPA).

All-to-All Communication: Another all-to-all operation reverses the redistribution, returning to sequence-sharded format.

Output Projection: Each GPU computes the output projection for its local sequence chunk.

Communication Complexity

Ulysses requires two all-to-all operations per attention layer, with total communication volume of O(n⋅d/P) O(n \cdot d / P) O(n⋅d/P) per GPU, where:

n n n is the sequence length

d d d is the hidden dimension

P P P is the parallelism degree

Integration with Accelerate

Accelerate provides the foundation for Ulysses sequence parallelism through its ParallelismConfig

Number of GPUs for sequence parallelism

Must be "deepspeed"

sp_seq_length_is_variable

sp_attn_implementation

"flash_attention_2"

"flash_attention_3"

Using the Accelerator

When you call accelerator.prepare()

Registers the model with DeepSpeed's UlyssesSPAttentionHF

Wraps the dataloader with UlyssesSPDataLoaderAdapter

Automatically injects shift_labels

Loss Aggregation

The weighted loss aggregation ensures correct gradients when tokens are unevenly distributed across ranks (e.g., when some ranks contain only padding or masked out prompt tokens).

Both Ulysses and Ring Attention use position_ids

disable_in_eval

Integration with Transformers Trainer

The Transformers Trainer

TrainingArguments.parallelism_config

Just pass the same parallelism_config

TrainingArguments

from transformers import TrainingArguments training_args = TrainingArguments( parallelism_config=parallelism_config, # same ParallelismConfig as above per_device_train_batch_size=1, )

What the Trainer Handles Automatically

Dataloader Wrapping: After model preparation, the Trainer wraps the dataloader with UlyssesSPDataLoaderAdapter

Loss Computation: The compute_loss

_deepspeed_sp_compute_loss

Gathering losses across SP ranks

Computing valid token counts per rank

Weighted loss aggregation

Batch Size Calculation: The effective data parallel world size accounts for SP:

dp_world_size = world_size // sp_size

Dataloader Length Adjustment: Training step calculations are adjusted for SP's effect on iteration count

Use an accelerate config file or command-line arguments:

accelerate launch \ --config_file deepspeed_ulysses.yaml \ train.py \ --per_device_train_batch_size 1

Integration with TRL SFTTrainer

TRL's SFTTrainer

Key SFTConfig Parameters for Ulysses

pad_to_multiple_of

Must equal sp_size

Global sequence length (before splitting across GPUs)

Works well with SP — packing reduces padding waste, especially for variable-length sequences

Accelerate Config File

Create alst_ulysses_4gpu.yaml

Complete Training Command

Shift Labels Handling

The SFTTrainer automatically handles pre-shifted labels when Ulysses is enabled:

When using SP, labels are pre-shifted by the dataloader adapter # The trainer detects this and uses shift_labels directly labels = inputs["labels"] if "shift_labels" not in inputs else None # Loss computation uses the pre-shifted labels if "shift_labels" in inputs: shift_logits = outputs.logits.contiguous() shift_labels = inputs["shift_labels"] else: shift_logits = outputs.logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous()

Comparing Ulysses and Ring Attention

Both Ulysses and Ring Attention enable long-context training, but they have different characteristics:

Ulysses (DeepSpeed)

Ring Attention (FSDP2)

Parallelism Method

Attention head partitioning

Ring-based KV exchange

Attention Support

FlashAttention 2/3, SDPA

P2P ring communication

Comm volume per GPU

O(total_seq x hidden / sp_size)

O(total_seq x hidden)

Sequence Divisibility

Num Head Constraint

num_heads >= sp_size

When to Choose Ulysses vs Ring Attention

Sequence Length Divisibility

Always ensure your sequence length is divisible by sp_size

training_args = SFTConfig( pad_to_multiple_of=4, # For sp_size=4 max_length=32768, # Must be divisible by 4 )

Use Flash Attention

Flash Attention 2 provides cleaner output and better performance than SDPA:

parallelism_config = ParallelismConfig( sp_handler=DeepSpeedSequenceParallelConfig( sp_attn_implementation="flash_attention_2", ), )

Use Flash Attention 3 for Hopper and look out for Flash Attention 4 release for Blackwell (FA2 on Blackwell is quite slow).

Combine with DeepSpeed ZeRO

For very large models, combine Ulysses with ZeRO Stage 3:

deepspeed_config: zero_stage: 3 offload_optimizer: device: cpu

If the model is huge, you can offload the params as well by adding to the above:

offload_param: device: cpu

Use memory fragmentation-friendly PyTorch allocator

This environment variable will allow for a longer sequence length:

export PYTORCH_ALLOC_CONF=expandable_segments:True

2D Parallelism Configuration

Balance SP and DP for your GPU count:

Balanced throughput and sequence length

Maximum sequence length

Higher throughput with moderate sequence length

Longer sequences with moderate throughput

Remember: dp_replicate_size × dp_shard_size × sp_size = num_processes

Liger-Kernel

If your desired model architecture is supported by Liger-Kernel, it is fully compatible with Ulysses SP and can be enabled with a single flag:

training_args = SFTConfig( use_liger_kernel=True, )

The main memory saving comes from FusedLinearCrossEntropy

Additionally, you can enable TiledMLP

FusedLinearCrossEntropy

Token Distribution Across Ranks

The benchmark runs in the table above use the same global batch size (8 micro-batches), cosine learning-rate schedule, and seed, so those benchmark loss curves are directly comparable.

Loss Curve Matching Diagnostics (4 GPU)

To verify SP-vs-DP loss equivalence, we ran controlled 4-GPU A/B experiments with identical seed, model, optimizer, learning-rate schedule, and data order.

Methodology for Fair DP vs SP Comparison

Compared setups:

DP=4, SP=1, GAS=1

DP=1, SP=4, GAS=4

For fair comparison, GAS

Ulysses SP splits the sequence across SP

DP tokens/step: dp_world_size * micro_batch * seq_len * GAS = 4 * B * L * 1

SP tokens/step: dp_world_size * micro_batch * (L/SP) * GAS * SP_ranks = 1 * B * (L/4) * 4 * 4 = 4 * B * L

Measured over 20 steps on 4 GPUs in controlled equivalence harnesses:

DP vs SP setting

DP=4, SP=1 vs DP=1, SP=4

Takeaway: under matched token budget, SP and non-SP match on canonical token-normalized loss. The remaining difference is in trainer-reported logging (loss

Memory Reduction

Baseline — no SP

Similar memory at same seq length

4x longer than DP baseline

8x longer than DP baseline

12x longer than DP baseline

Exceeds 80 GB limit

Baseline (1 GPU)

HF Accelerate: deepspeed>=0.18.1 accelerate>=1.12

HF Trainer: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0

HF TRL: deepspeed>=0.18.1 accelerate>=1.12 transformers>=5.0 trl>=0.18.0

Use flash_attention_2

flash_attention_3

flash_attention_4

Accelerate: Context Parallelism Guide

TRL: Distributing Training

DeepSpeed Sequence Parallelism

Accelerate ALST Example

TRL Accelerate Configs

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Related Blog Posts

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Understanding Ulysses and Ring Attention

Enabling Long-Context Training with Sequence Parallelism in Axolotl

この記事をシェア

Hugging Face Blog2026年7月1日 09:00

Hugging Face と Cerebras が Gemma 4 をリアルタイム音声 AI に導入

Hugging Face Blog重要度42026年7月1日 03:32

ScarfBench：エンタープライズ向け Java フレームワーク移行における AI エージェントのベンチマーク

Hugging Face Blog重要度42026年6月30日 23:39

専門化が不可避である理由

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

ユリシーズ・シーケンス並列処理：100万トークンのコンテキストでのトレーニング

キーポイント

影響分析

編集コメント

This registers the model with Ulysses and wraps the dataloader (この処理により、モデルが Ulysses に登録され、データローダーがラップされます)

関連記事

ユリシーズ・シーケンス並列処理：100万トークンのコンテキストでのトレーニング

キーポイント

影響分析

編集コメント

This registers the model with Ulysses and wraps the dataloader (この処理により、モデルが Ulysses に登録され、データローダーがラップされます)

関連記事