TLDR AI·2026年5月19日 09:00·約11分

HRM-Text（GitHub リポジトリ）：計算資源とデータを大幅に削減したテキスト生成モデル

#LLM #Efficient Training #Open Source #Model Architecture #Cost Reduction

TL;DR

HRM-Text は、従来の基盤モデルと比較して計算資源とデータを大幅に削減できる新アーキテクチャを採用し、低コストで高性能なテキスト生成モデルのトレーニングを可能にする画期的な成果である。

AI深層分析2026年5月20日 00:05

重要/ 5段階

深度40%

キーポイント

劇的なリソース削減効果

従来の基盤モデルと比較して、計算資源（compute）が 130-600 倍、データ量が 150-900 倍少なくて済むため、大規模なインフラが不要となる。

極めて低いトレーニングコスト

0.6B モデルは H100 8 枚で約 50 時間・800 ドル、1B モデルも H100 16 枚で約 46 時間・1,472 ドルという、個人や小規模チームでも実行可能なコストを実現している。

HRM アーキテクチャの革新

HRM-Text は HRM（Hierarchical Reasoning Model? あるいは特定のアーキテクチャ名）に基づいており、効率的な事前学習を可能にする新技術の核心を示している。

民主化への貢献

このアプローチにより、大規模モデルの事前学習がより多くの研究者や開発者にアクセス可能になり、AI 研究の裾野を広げる可能性がある。

影響分析・編集コメントを表示

影響分析

このニュースは、大規模言語モデルの開発コストを劇的に下げることで、AI 研究や開発の民主化に大きく寄与する可能性を示しています。特に、中小企業や個人研究者でも高品質な独自モデルをトレーニングできる環境が整うため、AI エコシステムの多様性と競争力が強化されるでしょう。

編集コメント

計算資源の制約から解放されることは、AI 開発のパラダイムシフトを意味します。特にコストと時間のデータが明確に示されている点は、実装検討において極めて貴重です。

HRM-Text: スケーリングを超えた効率的な事前学習

🌟 約$1000 でゼロからファウンデーションモデルを事前学習しましょう。 🌠

HRM-Text は、タスク完了と潜在空間推論によって強化された HRM アーキテクチャに基づく 1B テキスト生成モデルです。フル事前学習フレームワークを提供し、計算リソースを 130〜600 倍、データを 150〜900 倍削減してファウンデーションモデルの事前学習を可能にします。階層型再帰アーキテクチャ、プレフィックス LM シーケンスパッキング、FlashAttention 3 カーネル、PyTorch FSDP2 によるトレーニング・評価、およびチェックポイント変換ツールリングに基づいて構築されています。

事前学習の開始 🚀

必要なリソース

ターゲットサイズを選択し、対応する GPU ノードを準備してください。

L, 0.6B パラメータ: H100 を 8 基使用したシングルノードで約 50 時間（約$800）。

XL, 1B パラメータ: H100 を 16 基使用した 2 ノード構成で約 46 時間（約$1472）。

*価格見積もりは H100 の時間単価$2に基づいています。*

以下は参照実行からのベンチマーク結果です。

---|---|---|---|---|---|---|---|---|---|---

L (0.6B) | 8 | 50 時間 | 77.6% | 51.2% | 78.6% | 56.6% | 75.9% | 52.7% | 67.6% | 85.0%

XL (1B) | 16 | 46 時間 | 84.7% | 56.5% | 82.3% | 60.7% | 81.9% | 63.4% | 72.4% | 86.2%

Hopper クラスの GPU が期待されるトレーニングターゲットです。これはアテンションパスが FlashAttention 3 に依存するためです。

1. データの準備

HRM-Text は、コンパニオンデータ_io パイプラインによって生成されたサンプリング済みトークン化データを対象に学習を行います。data_io を使用して事前学習コーパスをクリーニングし、トークン化および層別サンプリングを行った後、そのサンプリング出力を HRM-Text の対象として指定してください。

推奨セットアップ:

Single node: run the data pipeline and pretraining on the same node. After tokenization, stratified-sample into that node's shared memory at /dev/shm/sampled.

Multi-node: keep data_io and the tokenized data on shared storage. Mount or expose that directory on every pretraining node, then run stratified sampling independently on each node. Sampling is fast and deterministic, so every node produces the same in-memory training data.

Please first setup data_io, then run the pipeline. After tokenization, run stratified sampling on each training node.

cd <DATA_IO_PATH>

python sample_tokenized.py epochs=4 output_path=/dev/shm/sampled > show_analytics.md

HRM-Text uses 4 training epochs by default. If you change epochs in the training config, change the sampling command to match.

2. Start the Environment

Set up the same environment on every pretraining node.

Recommended: Docker

We recommend running through the published Docker image that contains the full environment. Make sure Docker can see your GPUs, for example through NVIDIA Container Toolkit.

From the repo's directory:

docker run --gpus all --ipc=host --network=host -it \

-v "$PWD":/workspace \

sapientai/hrm-text:latest

For multi-node runs, mount the same shared workspace on every node. Keeping the code, tokenized data, and checkpoint directory at identical paths avoids version drift between ranks and makes FSDP2 checkpointing straightforward. A common layout is:

/shared/

|-- HRM-Text/

|--- checkpoints/

|-- data_io/

Alternative: Install from Source

Docker を使用しない場合は、まず PyTorch、CUDA、FlashAttention 3 をインストールしてください。テスト済みのバージョンは docker/Dockerfile に記載されています。

その後、Python の依存関係をインストールします:

pip install -r requirements.txt

Check Distributed Communication

マルチノード実行を行う場合は、長時間のジョブを開始する前に NCCL を確認してください。少なくとも、意図したノード間で torchrun が初期化できることを確認する必要があります。クラスターに nccl-tests が用意されている場合は、ノード内およびノード間の帯域幅チェックを両方実行してください。

Set Up W&B Tracking

HRM-Text はトレーニングのメトリクスを Weights & Biases にログ出力します。トレーニングを開始する前にログインしてください:

wandb login

ヘッドレス環境で実行する場合、https://wandb.ai/authorize から API キーを取得し、以下を実行してください:

wandb login <API_KEY>

3. Launch Pretraining

1 つの 8xH100 ノード上で L**-size の参照実行を行う場合:

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \

torchrun --nproc_per_node=8 pretrain.py arch/size@arch=L lr=2.5e-4 global_batch_size=172032

2 つの 8xH100 ノード上で XL-size の参照実行を行う場合、各ノードで以下を実行してください:

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \

torchrun \

--nproc_per_node=8 \

--nnodes=2 \

--node_rank=<NODE_RANK> \

--master_addr=<MASTER_ADDR> \

--master_port=<MASTER_PORT> \

pretrain.py

チェックポイントは各エポックごとに checkpoints/ディレクトリに保存されます。マルチノード実行の場合、各ノードは自身のシャードのみを保存するため、共有ストレージのマウントを推奨します。

4. Evaluate

評価では、ckpt_epoch が指定されていない場合、最新のチェックポイントのエポックが自動的に読み込まれます:

python -m evaluation.main ckpt_path="checkpoints/..."

特定のベンチマークセットを実行するには、run_only=[MATH,DROP,ARC,MMLU] をコマンドに追加してください。

評価には通常 80 GB の GPU が 1 基必要です。メモリ不足になる場合は、generation_config.batch_size=16 を追加してバッチサイズを下げてください。

評価スクリプトは Hugging Face datasets を使用しているため、ベンチマークデータは必要な時にオンデマンドでダウンロードされます。

5. Export to Transformers Format

python -m conversion.convert_to_hf \

--ckpt_path "checkpoints/..." \

--out_dir "<OUTPUT_PATH>"

評価およびエクスポートでは、チェックポイントに EMA が存在する場合、デフォルトで EMA 重みが使用されます。

Status

トレーニング、チェックポインティング、および評価は本リポジトリ内で実装されています。

Transformers フォーマットへのエクスポートは conversion/convert_to_hf.py で実装されています。

ネイティブ Transformers モデルのサポートはマージ済みで、次回のリリースに予定されています。

HRM-Text チェックポイントに対するネイティブ vLLM サポートは進行中です。

Training Overrides

デフォルトの事前トレーニング設定は config/cfg_pretrain.yaml です:

project_name、run_name、または checkpoint_path が省略された場合、ランク 0 がデータセットパス、アーキテクチャ名、および生成されたスラグからこれらを導出します。

Hydra のオーバーライドはコマンドラインで直接指定できます:

バニラ Transformer アーキテクチャ (サイズ L) を学習する

torchrun --nproc_per_node=8 pretrain.py \

arch/net@arch=transformer \

arch/size@arch=L

モデル設定

アーキテクチャは config/arch/net に格納されています:

Config (設定) Model (モデル)

hrm HRM-Text

transformer 標準的な Transformer ラッパー

trm Tiny Recursive Model ベースライン

trm_match_recurrence パラメータ数を半分にした HRM の再帰性を模倣するように構成された TRM

rins Recursive Inference Scaling (RINS) ベースライン

ut Universal Transformer ベースライン

サイズは config/arch/size に格納されています:

Config (設定) Layers (層数) Hidden (隠れ層次元) Heads (アテンションヘッド数)

B 12 1024 8

L 24 1280 10

XL 32 1536 12

XXL 72 1792 14

XXL_wide 32 2560 20

HRM および RINS の場合、half_layers: true を指定すると、設定された層数が H モジュールと L モジュールに均等に分割されます。

HRM-Text/

|-- config/ # モデル、データ、トレーニング用の Hydra 設定

|-- conversion/convert_to_hf.py # FSDP2 チェックポイントから HF スタイル形式へのエクスポート

|-- evaluation/ # 評価エンジン、ベンチマークラッパー、設定ファイル

|-- models/ # HRM、リカレントベースライン、Transformer ブロック、LM ヘッド

|-- docker/ # テスト済み CUDA/PyTorch/FlashAttention 環境

|-- dataset_new.py # プレフィックス LM パッキングデータセットローダー

|-- multipack_sampler.py # 分散型マルチパックバッチサンプリング

|-- pretrain.py # FSDP2 事前学習のエントリーポイント

|-- simple_inference_engine.py # チェックポイントローダーとコンパイル済み生成エンジン

`-- requirements.txt

技術的ノート

dataset_new.py は、サンプリングされたトークンの.npy ファイルとエポックごとのインデックス配列を読み込み、PrefixLM バッチを構築し、デフォルトで指示トークンをマスクし、FlashAttention のシーケンスメタデータを出力します。

multipack_sampler.py は、LPT 割り当てを実装した分散型マルチパックバッチ処理を行い、トークンスロットの利用効率を向上させつつ、二次的なアテンション計算負荷をバランスよく配分します。

models/flash_attention_prefixlm_v2.py は、二つのパスからなる PrefixLM アテンション経路を実装しています。一つはプレフィックス領域に対する双方向パス、もう一つは応答領域に対する因果的（causal）パスです。

models/layers.py には、RoPE、ゲート付きマルチヘッドアテンション、SwiGLU MLP、静的 KV キャッシュヘルパー、および初期化ユーティリティが含まれています。

models/baselines/hrm_nocarry_bp_warmup.py は、HRM-Text の主要なアーキテクチャを含んでいます。

models/lm_head.py は、スケーリングされた埋め込み表現、出力ヘッド、交差エントロピー損失（cross-entropy loss）、トークン精度、およびシーケンス完全一致精度を接続します。

pretrain.py は、FSDP2 のラップ処理、オプティマイザの作成、学習率スケジュール（LR schedule）、W&B によるログ記録、コードおよび設定のスナップショット取得、分散チェックポイント管理を担当します。

コントリビューション

HRM-Text をより高速に、より強力に、またはより使いやすくするためのコントリビューションを歓迎します。

データパイプラインに関する変更は、コンパニオンプロジェクトである data_io へ送信してください。モデル、トレーニング、推論、評価、変換、インフラストラクチャ、およびドキュメントに関する変更は、こちらへ送信してください。

推奨される PR（プルリクエスト）のカテゴリ:

ドキュメントとチュートリアル: セットアップ、データ準備、起動レシピ、評価、またはチェックポイント変換の明確化。

評価と推論: ベンチマークラッパーの追加、生成スループットの向上、VRAM の削減、または結果報告の改善。

トレーニングインフラストラクチャ: FSDP2 の安定性、効率性、チェックポイント機能、起動時の使いやすさ、ログ記録、またはクラスター間移植性の向上。

モデルとオプティマイザの変更: アーキテクチャ、再帰スケジュール、初期化、アテンションパス、オプティマイザ、またはトレーニングハイパーパラメータの改善。

事前学習の動作を変更する変更については、適切な規模で事前学習を実行し、リファレンスに対する下流ベンチマークとの比較結果を含めることを強く推奨します。

動作を維持することを意図したインフラストラクチャの変更については、変更前後の速度、メモリ使用量、または安定性の測定値を含め、ベンチマーク品質が劣化しないことを示してください。

モデル品質に関する変更については、トレーニング計算量とパフォーマンスのパレートフロンティアを改善するものであるかを評価します。厳密な改善や高い ROI をもたらす変更はデフォルト設定の候補として適しており、コストが高くパフォーマンスが低下する可能性があるが価値あるトレードオフとなる変更は、別個の設定に含めるべきです。

引用

引用情報は、付随する論文とともに加筆されます。

ライセンス

Apache License 2.0

原文を表示

HRM-Text: Efficient Pretraining Beyond Scaling

🌟 Pretrain a foundation model from scratch with ~$1000. 🌠

HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning. It offers a full pretraining framework, making foundation model pretraining accessible with 130-600x less compute and 150-900x less data. It is built upon a hierarchical recurrent architecture, PrefixLM sequence packing, FlashAttention 3 kernels, PyTorch FSDP2 training, evaluation, and checkpoint conversion tooling.

Launch the Pretraining 🚀

Required Resources

Choose a target size and prepare the corresponding GPU nodes.

L, 0.6B parameters: 8 H100s, single node, about 50 hours (~$800).

XL, 1B parameters: 16 H100s, two nodes, about 46 hours (~$1472).

*Price estimation based on $2/H100 hour.*

The following are benchmark results from the reference runs.

Size

GPUs

Time

GSM8k

MATH

DROP

MMLU

ARC-C

HellaSwag

Winogrande

BoolQ

L (0.6B)

50 hrs

77.6%

51.2%

78.6%

56.6%

75.9%

52.7%

67.6%

85.0%

XL (1B)

46 hrs

84.7%

56.5%

82.3%

60.7%

81.9%

63.4%

72.4%

86.2%

Hopper-class GPUs are the expected training target because the attention path depends on FlashAttention 3.

1. Prepare Data

HRM-Text trains from sampled, tokenized data produced by the companion data_io pipeline. Use data_io to clean, tokenize, and stratified-sample the pretraining corpus, then point HRM-Text at the sampled output.

Recommended setups:

Single node: run the data pipeline and pretraining on the same node. After tokenization, stratified-sample into that node's shared memory at /dev/shm/sampled.

Multi-node: keep data_io and the tokenized data on shared storage. Mount or expose that directory on every pretraining node, then run stratified sampling independently on each node. Sampling is fast and deterministic, so every node produces the same in-memory training data.

Please first setup data_io, then run the pipeline. After tokenization, run stratified sampling on each training node.

code

cd <DATA_IO_PATH>
python sample_tokenized.py epochs=4 output_path=/dev/shm/sampled > show_analytics.md

HRM-Text uses 4 training epochs by default. If you change epochs in the training config, change the sampling command to match.

2. Start the Environment

Set up the same environment on every pretraining node.

Recommended: Docker

We recommend running through the published Docker image that contains the full environment. Make sure Docker can see your GPUs, for example through NVIDIA Container Toolkit.

From the repo's directory:

code

docker run --gpus all --ipc=host --network=host -it \
  -v "$PWD":/workspace \
  sapientai/hrm-text:latest

code

/shared/
|-- HRM-Text/
   |--- checkpoints/
|-- data_io/

Alternative: Install from Source

If you are not using Docker, first install PyTorch, CUDA, and FlashAttention 3. The tested versions are documented in docker/Dockerfile.

Then install the Python dependencies:

code

pip install -r requirements.txt

Check Distributed Communication

For multi-node runs, verify NCCL before starting a long job. At minimum, confirm that torchrun can initialize across the intended nodes. If your cluster provides nccl-tests, run both intra-node and inter-node bandwidth checks.

Set Up W&B Tracking

HRM-Text logs training metrics to Weights & Biases. Log in before launching training:

code

wandb login

For headless runs, get an API key from https://wandb.ai/authorize and run:

code

wandb login <API_KEY>

3. Launch Pretraining

For the L-size reference run on one 8xH100 node:

code

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \
torchrun --nproc_per_node=8 pretrain.py arch/size@arch=L lr=2.5e-4 global_batch_size=172032

For the XL-size reference run on two 8xH100 nodes, run this on each node:

code

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=<NODE_RANK> \
  --master_addr=<MASTER_ADDR> \
  --master_port=<MASTER_PORT> \
  pretrain.py

Checkpoints are saved every epoch under checkpoints/. Remember for multi-node runs, each node only saves its own shard, so we recommend mounting a shared storage.

4. Evaluate

Evaluation loads the latest checkpoint epoch automatically when ckpt_epoch is not provided:

code

python -m evaluation.main ckpt_path="checkpoints/..."

To run a specified set of benchmarks, append run_only=[MATH,DROP,ARC,MMLU] to the command

Evaluation typically needs one 80 GB GPU. If evaluation runs out of memory, lower the batch size by adding generation_config.batch_size=16

The evaluation scripts use Hugging Face datasets, so benchmark data is downloaded on demand.

5. Export to Transformers Format

code

python -m conversion.convert_to_hf \
  --ckpt_path "checkpoints/..." \
  --out_dir "<OUTPUT_PATH>"

For evaluation and export, EMA weights are used by default when EMA is present in the checkpoint.

Status

Training, checkpointing, and evaluation are implemented in this repository.

Transformers-format export is implemented in conversion/convert_to_hf.py.

Native Transformers model support is merged and scheduled for the next release.

Native vLLM support for HRM-Text checkpoints is in progress.

Training Overrides

The default pretraining config is config/cfg_pretrain.yaml:

If project_name, run_name, or checkpoint_path are omitted, rank 0 derives them from the dataset path, architecture name, and a generated slug.

Hydra overrides can be passed directly on the command line:

code

# Train a vanilla Transformer architecture, size L
torchrun --nproc_per_node=8 pretrain.py \
  arch/net@arch=transformer \
  arch/size@arch=L

Model Configurations

Architectures live under config/arch/net:

Config

Model

hrm

HRM-Text

transformer

Standard Transformer wrapper

trm

Tiny Recursive Model baseline

trm_match_recurrence

TRM configured to match HRM recurrence with half parameters

rins

Recursive Inference Scaling (RINS) baseline

ut

Universal Transformer baseline

Sizes live under config/arch/size:

Config

Layers

Hidden

Heads

B

1024

L

1280

XL

1536

XXL

1792

XXL_wide

2560

For HRM and RINS, half_layers: true splits the configured layer count evenly between the H and L modules.

Repository Layout

code

HRM-Text/
|-- config/                       # Hydra configs for model, data, and training
|-- conversion/convert_to_hf.py    # FSDP2 checkpoint -> HF-style export
|-- evaluation/                    # Evaluation engines, benchmark wrappers, configs
|-- models/                        # HRM, recurrent baselines, Transformer blocks, LM head
|-- docker/                        # Tested CUDA/PyTorch/FlashAttention environment
|-- dataset_new.py                 # PrefixLM packed dataset loader
|-- multipack_sampler.py           # Distributed multipack batch sampler
|-- pretrain.py                    # FSDP2 pretraining entrypoint
|-- simple_inference_engine.py     # Checkpoint loader and compiled generation engine
`-- requirements.txt

Technical Notes

dataset_new.py loads sampled tokens.npy and per-epoch index arrays, builds PrefixLM batches, masks instruction tokens by default, and emits FlashAttention sequence metadata.

multipack_sampler.py implements distributed multipack batching with LPT allocation to improve token-slot utilization and balance quadratic attention work.

models/flash_attention_prefixlm_v2.py implements the two-pass PrefixLM attention path: one bidirectional pass over the prefix region and one causal pass over the response region.

models/layers.py contains RoPE, gated multi-head attention, SwiGLU MLPs, static KV cache helpers, and initialization utilities.

models/baselines/hrm_nocarry_bp_warmup.py contains the main HRM-Text architecture.

models/lm_head.py attaches scaled embeddings, the output head, cross-entropy loss, token accuracy, and sequence exact accuracy.

pretrain.py handles FSDP2 wrapping, optimizer creation, LR schedule, W&B logging, code/config snapshots, and distributed checkpointing.

Contributions

We welcome contributions that make HRM-Text faster, stronger, or easier to use.

Please send data-pipeline changes to the companion data_io project. Send model, training, inference, evaluation, conversion, infrastructure, and documentation changes here.

Recommended PR categories:

Docs and tutorials: clarify setup, data prep, launch recipes, evaluation, or checkpoint conversion.

Evaluation and inference: add benchmark wrappers, improve generation throughput, reduce VRAM, or improve result reporting.

Training infrastructure: improve FSDP2 stability, efficiency, checkpointing, launch ergonomics, logging, or cluster portability.

Model and optimizer changes: improve the architecture, recurrence schedule, initialization, attention path, optimizer, or training hyperparameters.

For changes that alter pretraining behavior, we strongly recommend running pretraining at an appropriate scale and including downstream benchmark comparisons against the reference.

For infrastructure changes intended to be behavior-preserving, include before/after speed, memory, or stability measurements and show that benchmark quality does not regress.

For model-quality changes, we evaluate whether the change improves the Pareto frontier of training compute versus performance. Strict improvements and high-ROI changes are good candidates for defaults; valuable tradeoffs with higher cost or lower performance may belong in separate configs.

Citation

Citation information will be added with the accompanying paper.

License

Apache License 2.0

この記事をシェア

MarkTechPost重要度42026年7月3日 05:51

アリババのページエージェント：DOM を介して自然言語で Web インターフェースを制御する JavaScript 内蔵 GUI エージェント

Allen AI (AI2)重要度42026年7月2日 17:00

大規模モジュラー LLM：デンマーク基盤モデルプロジェクトが FlexOlmo を活用し、機密データを共有せずに専門知識を集約する方法

MarkTechPost重要度42026年7月5日 11:31

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年5月19日 09:00·約11分

HRM-Text（GitHub リポジトリ）：計算資源とデータを大幅に削減したテキスト生成モデル

#LLM #Efficient Training #Open Source #Model Architecture #Cost Reduction

TL;DR

AI深層分析2026年5月20日 00:05

重要/ 5段階

深度40%

キーポイント

劇的なリソース削減効果

従来の基盤モデルと比較して、計算資源（compute）が 130-600 倍、データ量が 150-900 倍少なくて済むため、大規模なインフラが不要となる。

極めて低いトレーニングコスト

HRM アーキテクチャの革新

民主化への貢献

このアプローチにより、大規模モデルの事前学習がより多くの研究者や開発者にアクセス可能になり、AI 研究の裾野を広げる可能性がある。

影響分析・編集コメントを表示

影響分析

編集コメント

HRM-Text: スケーリングを超えた効率的な事前学習

🌟 約$1000 でゼロからファウンデーションモデルを事前学習しましょう。 🌠

事前学習の開始 🚀

必要なリソース

ターゲットサイズを選択し、対応する GPU ノードを準備してください。

L, 0.6B パラメータ: H100 を 8 基使用したシングルノードで約 50 時間（約$800）。

XL, 1B パラメータ: H100 を 16 基使用した 2 ノード構成で約 46 時間（約$1472）。

*価格見積もりは H100 の時間単価$2に基づいています。*

以下は参照実行からのベンチマーク結果です。

---|---|---|---|---|---|---|---|---|---|---

L (0.6B) | 8 | 50 時間 | 77.6% | 51.2% | 78.6% | 56.6% | 75.9% | 52.7% | 67.6% | 85.0%

XL (1B) | 16 | 46 時間 | 84.7% | 56.5% | 82.3% | 60.7% | 81.9% | 63.4% | 72.4% | 86.2%

Hopper クラスの GPU が期待されるトレーニングターゲットです。これはアテンションパスが FlashAttention 3 に依存するためです。

1. データの準備

推奨セットアップ:

Single node: run the data pipeline and pretraining on the same node. After tokenization, stratified-sample into that node's shared memory at /dev/shm/sampled.

Multi-node: keep data_io and the tokenized data on shared storage. Mount or expose that directory on every pretraining node, then run stratified sampling independently on each node. Sampling is fast and deterministic, so every node produces the same in-memory training data.

Please first setup data_io, then run the pipeline. After tokenization, run stratified sampling on each training node.

cd <DATA_IO_PATH>

python sample_tokenized.py epochs=4 output_path=/dev/shm/sampled > show_analytics.md

HRM-Text uses 4 training epochs by default. If you change epochs in the training config, change the sampling command to match.

2. Start the Environment

Set up the same environment on every pretraining node.

Recommended: Docker

We recommend running through the published Docker image that contains the full environment. Make sure Docker can see your GPUs, for example through NVIDIA Container Toolkit.

From the repo's directory:

docker run --gpus all --ipc=host --network=host -it \

-v "$PWD":/workspace \

sapientai/hrm-text:latest

/shared/

|-- HRM-Text/

|--- checkpoints/

|-- data_io/

Alternative: Install from Source

その後、Python の依存関係をインストールします:

pip install -r requirements.txt

Check Distributed Communication

Set Up W&B Tracking

HRM-Text はトレーニングのメトリクスを Weights & Biases にログ出力します。トレーニングを開始する前にログインしてください:

wandb login

ヘッドレス環境で実行する場合、https://wandb.ai/authorize から API キーを取得し、以下を実行してください:

wandb login <API_KEY>

3. Launch Pretraining

1 つの 8xH100 ノード上で L**-size の参照実行を行う場合:

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \

torchrun --nproc_per_node=8 pretrain.py arch/size@arch=L lr=2.5e-4 global_batch_size=172032

2 つの 8xH100 ノード上で XL-size の参照実行を行う場合、各ノードで以下を実行してください:

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \

torchrun \

--nproc_per_node=8 \

--nnodes=2 \

--node_rank=<NODE_RANK> \

--master_addr=<MASTER_ADDR> \

--master_port=<MASTER_PORT> \

pretrain.py

4. Evaluate

評価では、ckpt_epoch が指定されていない場合、最新のチェックポイントのエポックが自動的に読み込まれます:

python -m evaluation.main ckpt_path="checkpoints/..."

特定のベンチマークセットを実行するには、run_only=[MATH,DROP,ARC,MMLU] をコマンドに追加してください。

評価には通常 80 GB の GPU が 1 基必要です。メモリ不足になる場合は、generation_config.batch_size=16 を追加してバッチサイズを下げてください。

評価スクリプトは Hugging Face datasets を使用しているため、ベンチマークデータは必要な時にオンデマンドでダウンロードされます。

5. Export to Transformers Format

python -m conversion.convert_to_hf \

--ckpt_path "checkpoints/..." \

--out_dir "<OUTPUT_PATH>"

評価およびエクスポートでは、チェックポイントに EMA が存在する場合、デフォルトで EMA 重みが使用されます。

Status

トレーニング、チェックポインティング、および評価は本リポジトリ内で実装されています。

Transformers フォーマットへのエクスポートは conversion/convert_to_hf.py で実装されています。

ネイティブ Transformers モデルのサポートはマージ済みで、次回のリリースに予定されています。

HRM-Text チェックポイントに対するネイティブ vLLM サポートは進行中です。

Training Overrides

デフォルトの事前トレーニング設定は config/cfg_pretrain.yaml です:

Hydra のオーバーライドはコマンドラインで直接指定できます:

バニラ Transformer アーキテクチャ (サイズ L) を学習する

torchrun --nproc_per_node=8 pretrain.py \

arch/net@arch=transformer \

arch/size@arch=L

モデル設定

アーキテクチャは config/arch/net に格納されています:

Config (設定) Model (モデル)

hrm HRM-Text

transformer 標準的な Transformer ラッパー

trm Tiny Recursive Model ベースライン

trm_match_recurrence パラメータ数を半分にした HRM の再帰性を模倣するように構成された TRM

rins Recursive Inference Scaling (RINS) ベースライン

ut Universal Transformer ベースライン

サイズは config/arch/size に格納されています:

Config (設定) Layers (層数) Hidden (隠れ層次元) Heads (アテンションヘッド数)

B 12 1024 8

L 24 1280 10

XL 32 1536 12

XXL 72 1792 14

XXL_wide 32 2560 20

HRM および RINS の場合、half_layers: true を指定すると、設定された層数が H モジュールと L モジュールに均等に分割されます。

HRM-Text/

|-- config/ # モデル、データ、トレーニング用の Hydra 設定

|-- conversion/convert_to_hf.py # FSDP2 チェックポイントから HF スタイル形式へのエクスポート

|-- evaluation/ # 評価エンジン、ベンチマークラッパー、設定ファイル

|-- models/ # HRM、リカレントベースライン、Transformer ブロック、LM ヘッド

|-- docker/ # テスト済み CUDA/PyTorch/FlashAttention 環境

|-- dataset_new.py # プレフィックス LM パッキングデータセットローダー

|-- multipack_sampler.py # 分散型マルチパックバッチサンプリング

|-- pretrain.py # FSDP2 事前学習のエントリーポイント

|-- simple_inference_engine.py # チェックポイントローダーとコンパイル済み生成エンジン

`-- requirements.txt

技術的ノート

dataset_new.py は、サンプリングされたトークンの.npy ファイルとエポックごとのインデックス配列を読み込み、PrefixLM バッチを構築し、デフォルトで指示トークンをマスクし、FlashAttention のシーケンスメタデータを出力します。

multipack_sampler.py は、LPT 割り当てを実装した分散型マルチパックバッチ処理を行い、トークンスロットの利用効率を向上させつつ、二次的なアテンション計算負荷をバランスよく配分します。

models/flash_attention_prefixlm_v2.py は、二つのパスからなる PrefixLM アテンション経路を実装しています。一つはプレフィックス領域に対する双方向パス、もう一つは応答領域に対する因果的（causal）パスです。

models/layers.py には、RoPE、ゲート付きマルチヘッドアテンション、SwiGLU MLP、静的 KV キャッシュヘルパー、および初期化ユーティリティが含まれています。

models/baselines/hrm_nocarry_bp_warmup.py は、HRM-Text の主要なアーキテクチャを含んでいます。

models/lm_head.py は、スケーリングされた埋め込み表現、出力ヘッド、交差エントロピー損失（cross-entropy loss）、トークン精度、およびシーケンス完全一致精度を接続します。

pretrain.py は、FSDP2 のラップ処理、オプティマイザの作成、学習率スケジュール（LR schedule）、W&B によるログ記録、コードおよび設定のスナップショット取得、分散チェックポイント管理を担当します。

コントリビューション

HRM-Text をより高速に、より強力に、またはより使いやすくするためのコントリビューションを歓迎します。

推奨される PR（プルリクエスト）のカテゴリ:

ドキュメントとチュートリアル: セットアップ、データ準備、起動レシピ、評価、またはチェックポイント変換の明確化。

評価と推論: ベンチマークラッパーの追加、生成スループットの向上、VRAM の削減、または結果報告の改善。

トレーニングインフラストラクチャ: FSDP2 の安定性、効率性、チェックポイント機能、起動時の使いやすさ、ログ記録、またはクラスター間移植性の向上。

モデルとオプティマイザの変更: アーキテクチャ、再帰スケジュール、初期化、アテンションパス、オプティマイザ、またはトレーニングハイパーパラメータの改善。

引用

引用情報は、付随する論文とともに加筆されます。

ライセンス

Apache License 2.0

原文を表示

HRM-Text: Efficient Pretraining Beyond Scaling

🌟 Pretrain a foundation model from scratch with ~$1000. 🌠

Launch the Pretraining 🚀

Required Resources

Choose a target size and prepare the corresponding GPU nodes.

L, 0.6B parameters: 8 H100s, single node, about 50 hours (~$800).

XL, 1B parameters: 16 H100s, two nodes, about 46 hours (~$1472).

*Price estimation based on $2/H100 hour.*

The following are benchmark results from the reference runs.

Size

GPUs

Time

GSM8k

MATH

DROP

MMLU

ARC-C

HellaSwag

Winogrande

BoolQ

L (0.6B)

50 hrs

77.6%

51.2%

78.6%

56.6%

75.9%

52.7%

67.6%

85.0%

XL (1B)

46 hrs

84.7%

56.5%

82.3%

60.7%

81.9%

63.4%

72.4%

86.2%

Hopper-class GPUs are the expected training target because the attention path depends on FlashAttention 3.

1. Prepare Data

Recommended setups:

Single node: run the data pipeline and pretraining on the same node. After tokenization, stratified-sample into that node's shared memory at /dev/shm/sampled.

Multi-node: keep data_io and the tokenized data on shared storage. Mount or expose that directory on every pretraining node, then run stratified sampling independently on each node. Sampling is fast and deterministic, so every node produces the same in-memory training data.

Please first setup data_io, then run the pipeline. After tokenization, run stratified sampling on each training node.

code

cd <DATA_IO_PATH>
python sample_tokenized.py epochs=4 output_path=/dev/shm/sampled > show_analytics.md

HRM-Text uses 4 training epochs by default. If you change epochs in the training config, change the sampling command to match.

2. Start the Environment

Set up the same environment on every pretraining node.

Recommended: Docker

We recommend running through the published Docker image that contains the full environment. Make sure Docker can see your GPUs, for example through NVIDIA Container Toolkit.

From the repo's directory:

code

docker run --gpus all --ipc=host --network=host -it \
  -v "$PWD":/workspace \
  sapientai/hrm-text:latest

code

/shared/
|-- HRM-Text/
   |--- checkpoints/
|-- data_io/

Alternative: Install from Source

If you are not using Docker, first install PyTorch, CUDA, and FlashAttention 3. The tested versions are documented in docker/Dockerfile.

Then install the Python dependencies:

code

pip install -r requirements.txt

Check Distributed Communication

Set Up W&B Tracking

HRM-Text logs training metrics to Weights & Biases. Log in before launching training:

code

wandb login

For headless runs, get an API key from https://wandb.ai/authorize and run:

code

wandb login <API_KEY>

3. Launch Pretraining

For the L-size reference run on one 8xH100 node:

code

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \
torchrun --nproc_per_node=8 pretrain.py arch/size@arch=L lr=2.5e-4 global_batch_size=172032

For the XL-size reference run on two 8xH100 nodes, run this on each node:

code

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 \
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=<NODE_RANK> \
  --master_addr=<MASTER_ADDR> \
  --master_port=<MASTER_PORT> \
  pretrain.py

Checkpoints are saved every epoch under checkpoints/. Remember for multi-node runs, each node only saves its own shard, so we recommend mounting a shared storage.

4. Evaluate

Evaluation loads the latest checkpoint epoch automatically when ckpt_epoch is not provided:

code

python -m evaluation.main ckpt_path="checkpoints/..."

To run a specified set of benchmarks, append run_only=[MATH,DROP,ARC,MMLU] to the command

Evaluation typically needs one 80 GB GPU. If evaluation runs out of memory, lower the batch size by adding generation_config.batch_size=16

The evaluation scripts use Hugging Face datasets, so benchmark data is downloaded on demand.

5. Export to Transformers Format

code

python -m conversion.convert_to_hf \
  --ckpt_path "checkpoints/..." \
  --out_dir "<OUTPUT_PATH>"

For evaluation and export, EMA weights are used by default when EMA is present in the checkpoint.

Status

Training, checkpointing, and evaluation are implemented in this repository.

Transformers-format export is implemented in conversion/convert_to_hf.py.

Native Transformers model support is merged and scheduled for the next release.

Native vLLM support for HRM-Text checkpoints is in progress.

Training Overrides

The default pretraining config is config/cfg_pretrain.yaml:

If project_name, run_name, or checkpoint_path are omitted, rank 0 derives them from the dataset path, architecture name, and a generated slug.

Hydra overrides can be passed directly on the command line:

code

# Train a vanilla Transformer architecture, size L
torchrun --nproc_per_node=8 pretrain.py \
  arch/net@arch=transformer \
  arch/size@arch=L

Model Configurations

Architectures live under config/arch/net:

Config

Model

hrm

HRM-Text

transformer

Standard Transformer wrapper

trm

Tiny Recursive Model baseline

trm_match_recurrence

TRM configured to match HRM recurrence with half parameters

rins

Recursive Inference Scaling (RINS) baseline

ut

Universal Transformer baseline

Sizes live under config/arch/size:

Config

Layers

Hidden

Heads

B

1024

L

1280

XL

1536

XXL

1792

XXL_wide

2560

For HRM and RINS, half_layers: true splits the configured layer count evenly between the H and L modules.

Repository Layout

code

HRM-Text/
|-- config/                       # Hydra configs for model, data, and training
|-- conversion/convert_to_hf.py    # FSDP2 checkpoint -> HF-style export
|-- evaluation/                    # Evaluation engines, benchmark wrappers, configs
|-- models/                        # HRM, recurrent baselines, Transformer blocks, LM head
|-- docker/                        # Tested CUDA/PyTorch/FlashAttention environment
|-- dataset_new.py                 # PrefixLM packed dataset loader
|-- multipack_sampler.py           # Distributed multipack batch sampler
|-- pretrain.py                    # FSDP2 pretraining entrypoint
|-- simple_inference_engine.py     # Checkpoint loader and compiled generation engine
`-- requirements.txt

Technical Notes

dataset_new.py loads sampled tokens.npy and per-epoch index arrays, builds PrefixLM batches, masks instruction tokens by default, and emits FlashAttention sequence metadata.

multipack_sampler.py implements distributed multipack batching with LPT allocation to improve token-slot utilization and balance quadratic attention work.

models/flash_attention_prefixlm_v2.py implements the two-pass PrefixLM attention path: one bidirectional pass over the prefix region and one causal pass over the response region.

models/layers.py contains RoPE, gated multi-head attention, SwiGLU MLPs, static KV cache helpers, and initialization utilities.

models/baselines/hrm_nocarry_bp_warmup.py contains the main HRM-Text architecture.

models/lm_head.py attaches scaled embeddings, the output head, cross-entropy loss, token accuracy, and sequence exact accuracy.

pretrain.py handles FSDP2 wrapping, optimizer creation, LR schedule, W&B logging, code/config snapshots, and distributed checkpointing.

Contributions

We welcome contributions that make HRM-Text faster, stronger, or easier to use.

Please send data-pipeline changes to the companion data_io project. Send model, training, inference, evaluation, conversion, infrastructure, and documentation changes here.

Recommended PR categories:

Docs and tutorials: clarify setup, data prep, launch recipes, evaluation, or checkpoint conversion.

Evaluation and inference: add benchmark wrappers, improve generation throughput, reduce VRAM, or improve result reporting.

Training infrastructure: improve FSDP2 stability, efficiency, checkpointing, launch ergonomics, logging, or cluster portability.

Model and optimizer changes: improve the architecture, recurrence schedule, initialization, attention path, optimizer, or training hyperparameters.

For changes that alter pretraining behavior, we strongly recommend running pretraining at an appropriate scale and including downstream benchmark comparisons against the reference.

For infrastructure changes intended to be behavior-preserving, include before/after speed, memory, or stability measurements and show that benchmark quality does not regress.

Citation

Citation information will be added with the accompanying paper.

License

Apache License 2.0

この記事をシェア

MarkTechPost重要度42026年7月3日 05:51

アリババのページエージェント：DOM を介して自然言語で Web インターフェースを制御する JavaScript 内蔵 GUI エージェント

Allen AI (AI2)重要度42026年7月2日 17:00

大規模モジュラー LLM：デンマーク基盤モデルプロジェクトが FlexOlmo を活用し、機密データを共有せずに専門知識を集約する方法

MarkTechPost重要度42026年7月5日 11:31

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

HRM-Text: スケーリングを超えた効率的な事前学習

事前学習の開始 🚀

必要なリソース

1. データの準備

2. Start the Environment

Recommended: Docker

Alternative: Install from Source

Check Distributed Communication

Set Up W&B Tracking

3. Launch Pretraining

4. Evaluate

5. Export to Transformers Format

Status

Training Overrides

バニラ Transformer アーキテクチャ (サイズ L) を学習する

モデル設定

技術的ノート

コントリビューション

引用

ライセンス

HRM-Text: Efficient Pretraining Beyond Scaling

Launch the Pretraining 🚀

Required Resources

1. Prepare Data

2. Start the Environment

Recommended: Docker

Alternative: Install from Source

Check Distributed Communication

Set Up W&B Tracking

3. Launch Pretraining

4. Evaluate

5. Export to Transformers Format

Status

Training Overrides

Model Configurations

Repository Layout

Technical Notes

Contributions

Citation

License

関連記事

キーポイント

影響分析

編集コメント

HRM-Text: スケーリングを超えた効率的な事前学習

事前学習の開始 🚀

必要なリソース

1. データの準備

2. Start the Environment

Recommended: Docker

Alternative: Install from Source

Check Distributed Communication

Set Up W&B Tracking

3. Launch Pretraining

4. Evaluate

5. Export to Transformers Format

Status

Training Overrides

バニラ Transformer アーキテクチャ (サイズ L) を学習する

モデル設定

技術的ノート

コントリビューション

引用

ライセンス

HRM-Text: Efficient Pretraining Beyond Scaling

Launch the Pretraining 🚀

Required Resources

1. Prepare Data

2. Start the Environment

Recommended: Docker

Alternative: Install from Source

Check Distributed Communication

Set Up W&B Tracking

3. Launch Pretraining

4. Evaluate

5. Export to Transformers Format

Status