Hugging Face Blog·2026年3月10日 09:00·約12分

Hugging Face Hubにストレージバケット機能を導入

#MLOps #ストレージ最適化 #重複排除 #Hugging Face Hub #プロダクションML #データ管理

TL;DR

Hugging Face Hubは、機械学習ワークロード向けに中間ファイルを効率的に管理するための非バージョン管理型ストレージ「Storage Buckets」を導入し、Xetバックエンドによる重複排除と高速転送を実現した。

AI深層分析2026年3月11日 02:43

重要/ 5段階

深度40%

キーポイント

Storage Bucketsの導入目的

機械学習のプロダクション環境で頻繁に生成・更新される中間ファイル（チェックポイント、最適化状態、ログなど）を、Gitのようなバージョン管理システムではなく、可変的で高速なオブジェクトストレージとして管理するために設計された。

Xetバックエンドによる技術的優位性

ファイルをチャンクに分割し、重複排除を行うことで、類似データの転送帯域幅を削減し、ストレージ効率と転送速度を向上させる。特に連続するチェックポイントや処理済みデータセットの保存に効果的。

実用性とコスト効率

企業顧客向けに、重複排除後のストレージ使用量に基づく課金モデルを提供し、コスト削減とパフォーマンス向上の両方を実現する。

機械学習ワークロードへの最適化

トレーニングクラスター、データパイプライン、エージェントのトレース保存など、多様なMLプロダクションシナリオに特化したストレージソリューションとして位置づけられている。

Pre-warming for Performance Optimization

Pre-warming allows users to bring data closer to their compute resources by declaring the needed region, ensuring data is already there when jobs start, which is crucial for distributed training and multi-region setups.

Multi-Platform and Language Support

Buckets support integration via CLI, Python (huggingface_hub), and JavaScript (@huggingface/hub), with filesystem access through HfFileSystem, enabling programmatic management in various environments.

既存ワークフローとの統合性

fsspec標準インターフェースを介して、pandas、Polars、Daskなどのライブラリがhf://プロトコルで直接Bucketから読み書きでき、既存のデータワークフローを変更せずに統合可能。

影響分析・編集コメントを表示

影響分析

この発表は、機械学習のプロダクション環境におけるデータ管理の実践的な課題に直接応えるもので、Hugging Faceのプラットフォームを単なるモデル/データセットのリポジトリから、包括的なMLOpsインフラへと進化させる重要な一歩となる。特に重複排除技術による効率化は、大規模なトレーニングやデータ処理を行う組織にとってコストとパフォーマンスの両面で大きな価値をもたらす可能性がある。

編集コメント

機械学習の実運用における「地味だが重要な」インフラ課題に光を当てた実用的なアップデート。Hugging Faceがプラットフォームの機能拡張を続ける中で、開発者体験と企業導入の両面を強化する戦略的な動きと言える。

記事に戻る Hugging Face Hubにストレージバケットを導入

Upvote 53

Hugging Faceのモデルおよびデータセットリポジトリは、最終成果物を公開するのに最適です。しかし、プロダクション環境の機械学習では、頻繁に変更され、多くのジョブから同時に送信され、バージョン管理がほとんど必要とされない中間ファイル（チェックポイント、オプティマイザ状態、処理済みシャード、ログ、トレースなど）が絶えず生成されます。

ストレージバケットはまさにこの目的のために構築されました：Hub上で閲覧可能で、Pythonからスクリプト操作したり、hf CLIで管理したりできる、可変的なS3ライクなオブジェクトストレージです。

バケットを構築した理由

以下のような状況では、Gitはすぐに適切な抽象化ではないと感じられるようになります：

トレーニングクラスターが実行中にチェックポイントとオプティマイザ状態を書き込む場合
データパイプラインが生データセットを反復的に処理する場合
エージェントがトレース、メモリ、共有知識グラフを保存する場合

これらすべてのケースにおけるストレージの要件は同じです：高速な書き込み、必要に応じた上書き、ディレクトリの同期、古いファイルの削除、そして作業を滞りなく進めることです。

バケットは、Hub上のバージョン管理されないストレージコンテナです。ユーザーまたは組織の名前空間の下に存在し、標準的なHugging Faceのアクセス権限を持ち、非公開または公開に設定でき、ブラウザで開けるページがあり、hf://buckets/username/my-training-bucket のようなハンドルでプログラム的にアクセスできます。

Xetが重要な理由

バケットは、Hugging FaceのチャンクベースのストレージバックエンドであるXet上に構築されています。これは見た目以上に重要な点です。

Xetは、ファイルを単一の塊として扱う代わりに、コンテンツをチャンクに分割し、それらをまたいで重複排除を行います。生データとほとんど同じ処理済みデータセットをアップロードしますか？多くのチャンクは既に存在しています。モデルの大部分が固定された連続的なチェックポイントを保存しますか？同じことです。バケットは既に存在するバイトをスキップするため、帯域幅の削減、転送の高速化、ストレージの効率化が実現します。

これは機械学習ワークロードに自然に適合します。トレーニングパイプラインは常に関連する成果物群（生データと処理済みデータ、連続するチェックポイント、エージェントのトレースと派生サマリー）を生成し、Xetはその重複を活用するように設計されています。

エンタープライズ顧客にとっては、課金は重複排除されたストレージ容量に基づくため、共有チャンクは課金対象のフットプリントを直接削減します。重複排除は速度とコストの両方に寄与します。

プリウォーミング：データをコンピュートリソースに近づける

バケットはHub上に存在するため、デフォルトではグローバルストレージです。しかし、分散トレーニングや大規模パイプラインにおいて、すべてのワークロードが、データがたまたま存在する場所からデータを取得するためのコストを負担できるわけではありません。ストレージの場所はスループットに直接影響します。

プリウォーミングにより、ホットデータをコンピュートリソースが実行されるクラウドプロバイダーとリージョンに近づけることができます。データが読み取りのたびにリージョンをまたがって移動する代わりに、必要な場所を宣言すると、バケットはジョブ開始時に既にそこにデータがあることを保証します。これは、大規模なデータセットやチェックポイントへの高速アクセスが必要なトレーニングクラスターや、パイプラインの異なる部分が異なるクラウドで実行されるマルチリージョン設定で特に有用です。

まずはAWSとGCPと提携しており、今後さらに多くのクラウドプロバイダーが追加される予定です。

はじめに

hf CLIを使用すれば、2分以内にバケットを立ち上げて実行できます。

code

curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login

プロジェクト用のバケットを作成します：

code

hf buckets create my-training-bucket --private

トレーニングジョブがローカルの ./checkpoints にチェックポイントを書き込んでいるとします。

code

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints

大規模な転送では、実際の移動前に何が起こるかを確認したいかもしれません。--dry-run オプションを使用します。

code

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --dry-run

また、計画をファイルに保存してレビューし、後で適用することもできます：

code

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --plan sync-plan.jsonl
hf buckets sync --apply sync-plan.jsonl

完了したら、CLIからバケットを確認します：

code

hf buckets list username/my-training-bucket -h

または、Hub上で直接閲覧します：https://huggingface.co/buckets/username/my-training-bucket

これが全体の流れです。バケットを作成し、作業データを同期し、必要に応じて確認し、公開する価値のあるものにはバージョン管理されたリポジトリを保存します。単発の操作には、hf buckets cp や hf buckets remove も使用できます。

Pythonからバケットを使用する

上記のすべては、huggingface_hub ライブラリを介してPythonからも動作します。

python

from huggingface_hub import create_bucket, list_bucket_tree, sync_bucket

create_bucket("my-training-bucket", private=True, exist_ok=True)
sync_bucket(
    "./checkpoints",
    "hf://buckets/username/my-training-bucket/checkpoints",
)
for item in list_bucket_tree(
    "username/my-training-bucket",
    prefix="checkpoints",
    recursive=True,
):
    print(item.path, item.size)

これにより、バケットをトレーニングスクリプト、データパイプライン、または成果物をプログラム的に管理する任意のサービスに簡単に統合できます。Pythonクライアントは、より細かい制御が必要な場合のバッチアップロード、選択的ダウンロード、削除、バケット移動もサポートしています。

バケットサポートは、JavaScriptでは @huggingface/hub を介して利用可能です。

ファイルシステム統合

バケットは HfFileSystem (huggingface_hub の一部) を介しても動作します。

python

from huggingface_hub import hffs

# バケットディレクトリ内のファイルをリスト
hffs.ls("buckets/username/my-training-bucket/checkpoints", detail=False)
# 特定のファイルをグロブ
hffs.glob("buckets/username/my-training-bucket/**/*.parquet")
# ファイルを直接読み取り
with hffs.open("buckets/username/my-training-bucket/config.yaml", "r") as f:
    print(f.read())

fsspec はリモートファイルシステムの標準的なPythonインターフェースであるため、pandas、Polars、Dask などのライブラリは hf:// プロトコルを使用してバケットから読み取り、書き込みができます。

python

import pandas as pd

# バケットから直接CSVを読み取り
df = pd.read_csv("hf://buckets/username/my-training-bucket/results.csv")
# 結果を書き戻し
df.to_csv("hf://buckets/username/my-training-bucket/summary.csv")

これにより、コードのファイル読み書き方法を変更することなく、既存のデータワークフローにバケットを簡単に組み込むことができます。

バケットからバージョン管理リポジトリへ

バケットは、成果物がまだ流動的な状態にある間、それらが存在する高速で可変的な場所です。何かが安定した成果物になると、通常はバージョン管理されたモデルまたはデータセットリポジトリに属します。

ロードマップ上では、バケットとリポジトリ間の双方向直接転送をサポートする予定です：最終的なチェックポイントの重みをモデルリポジトリに昇格させたり、パイプライン完了後に処理済みシャードをデータセットリポジトリにコミットしたりします。作業レイヤーと公開レイヤーは分離されたままですが、一連のHubネイティブなワークフローに適合します。

ローンチパートナーに信頼されています

バケットを全員に公開する前に、少数のローンチパートナーグループとプライベートベータを実施しました。

初期バージョンをテストし、バグを発見し、この機能を直接的に形作ったフィードバックを共有してくれたJasper、Arcee、IBM、PixAIに心から感謝します。

結論とリソース

ストレージバケットは、Hubに欠けていたストレージレイヤーをもたらします。これらは、機械学習の可変的で高スループットな側面（チェックポイント、処理済みデータ、エージェントトレース、ログ、最終的な成果物になる前に有用なその他すべて）のためのHubネイティブな場所を提供します。

Xet上に構築されているため、バケットはすべてをGit経由で強制するよりも使いやすいだけでなく、AIシステムが常に生成する種類の関連成果物に対してより効率的です。これは、より高速な転送、より優れた重複排除、およびエンタープライズプランでは重複排除されたフットプリントの恩恵を受ける課金を意味します。

既にHubを使用している場合、バケットによりワークフローのより多くの部分を一箇所に保つことができます。S3スタイルのストレージから来た場合、AI成果物により適した親しみやすいモデルと、Hub上での最終公開への明確な道筋を提供します。

バケットは既存のHubストレージプランに含まれています。無料アカウントには開始用のストレージが付属し、PROおよびエンタープライズプランではより高い制限を提供します。詳細はストレージページをご覧ください。

さらに読み、自分で試してみてください：

インストールガイド
CLIガイドとリファレンス
Hub上のバケット例
ストレージ料金

ストレージバケットは、Hugging Face Hubの既存のインフラストラクチャとシームレスに統合されています。これにより、モデル、データセット、スペース（Spaces）などの他のHubリソースと同じように、バケットを管理および共有できます。

バケットは、Hugging Face Hubのアクセス制御システムを利用しています。バケットを公開（public）に設定して誰でもアクセスできるようにしたり、非公開（private）に設定して特定のユーザーや組織のみがアクセスできるようにしたりできます。また、バケット内の個々のファイルに対して、読み取り（read）や書き込み（write）などのきめ細かい権限を設定することも可能です。

バケットは、Hugging Face Hubの検索および発見機能とも統合されています。バケットにメタデータ（タイトル、説明、タグなど）を追加すると、Hub上で他のユーザーがそのバケットを見つけやすくなります。また、バケットはモデルカードやデータセットカードにリンクできるため、関連するリソースをまとめて管理できます。

ストレージバケットは、Hugging Faceのエコシステム内のさまざまなツールやサービスと連携するように設計されています。例えば、Hugging FaceのPythonライブラリ（huggingface_hub）を使用して、バケットとの間でファイルをアップロードおよびダウンロードできます。また、datasetsライブラリを使用して、バケットに保存されたデータセットを直接ロードすることも可能です。

バケットは、Hugging Face Hubの既存のワークフローを強化することを目的としています。例えば、大規模なトレーニングデータセットをバケットに保存し、それをHugging Faceのトレーニングリソース（Training Resources）で使用してモデルをトレーニングできます。または、推論結果や生成されたアセットをバケットに保存し、Hugging Face Spacesでホストされているアプリケーションからアクセスすることもできます。

ストレージバケットは、Hugging Face Hubの進化における重要な一歩です。これにより、ユーザーは機械学習のライフサイクル全体を通じて、データとモデルをより効果的に管理および共有できるようになります。私たちは、コミュニティがこの新しい機能を活用して、より協力的で効率的な機械学習ワークフローを構築することを楽しみにしています。

バケットの使用を開始するには、Hugging Face Hubのドキュメントを参照してください。そこには、バケットの作成、管理、および使用に関する詳細な手順が記載されています。また、Hugging Faceフォーラムでは、コミュニティメンバーがバケットの使用経験を共有し、質問を投稿できる場を提供しています。

ご質問やフィードバックがありましたら、お気軽にお問い合わせください。Hugging Faceチームは、ストレージバケットとHugging Face Hubの全体的な体験を継続的に改善するために、コミュニティからの意見を歓迎しています。

原文を表示

Back to Articles Introducing Storage Buckets on the Hugging Face Hub

Upvote 53

Hugging Face Models and Datasets repos are great for publishing final artifacts. But production ML generates a constant stream of intermediate files (checkpoints, optimizer states, processed shards, logs, traces, etc.) that change often, arrive from many jobs at once, and rarely need version control.

Storage Buckets are built exactly for this: mutable, S3-like object storage you can browse on the Hub, script from Python, or manage with the hf

Why we built Buckets

Git starts to feel like the wrong abstraction pretty quickly when you're dealing with:

Training clusters writing checkpoints and optimizer states throughout a run

Data pipelines processing raw datasets iteratively

Agents storing traces, memory, and shared knowledge graphs

The storage need in all these cases is the same: write fast, overwrite when needed, sync directories, remove stale files, and keep things moving.

A Bucket is a non-versioned storage container on the Hub. It lives under a user or organization namespace, has standard Hugging Face permissions, can be private or public, has a page you can open in your browser, and can be addressed programmatically with a handle like hf://buckets/username/my-training-bucket

Why Xet matters

Buckets are built on Xet, Hugging Face’s chunk-based storage backend, and this matters more than it might seem.

Instead of treating files as monolithic blobs, Xet breaks content into chunks and deduplicates across them. Upload a processed dataset that’s mostly similar to the raw one? Many chunks already exist. Store successive checkpoints where large parts of the model are frozen? Same story. Buckets skip the bytes that are already there, which means less bandwidth, faster transfers, and more efficient storage.

This is a natural fit for ML workloads. Training pipelines constantly produce families of related artifacts — raw and processed data, successive checkpoints, Agent traces and derived summaries — and Xet is designed to take advantage of that overlap.

For Enterprise customers, billing is based on deduplicated storage, so shared chunks directly reduce the billed footprint. Deduplication helps with both speed and cost.

Pre-warming: bringing data close to compute

Buckets live on the Hub, which means global storage by default. But not every workload can afford to pull data from wherever it happens to live, for distributed training and large-scale pipelines, storage location directly affects throughput.

Pre-warming lets you bring hot data closer to the cloud provider and region where your compute runs. Instead of data traveling across regions on every read, you declare where you need it and Buckets make sure it's already there when your jobs start. This is especially useful for training clusters that need fast access to large datasets or checkpoints, and for multi-region setups where different parts of a pipeline run in different clouds.

We are partnering with AWS and GCP to start with, more more cloud providers coming in the future.

Getting started

You can get a bucket up and running in under 2 minutes with the hf

curl -LsSf https://hf.co/cli/install.sh | bash hf auth login

Create a bucket for your project:

hf buckets create my-training-bucket --private

Say your training job is writing checkpoints locally to ./checkpoints

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints

For large transfers, you might want to see what will happen before anything moves. --dry-run

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --dry-run

You can also save the plan to a file for review and apply it later:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --plan sync-plan.jsonl hf buckets sync --apply sync-plan.jsonl

Once done, inspect the Bucket from the CLI:

hf buckets list username/my-training-bucket -h

or browse it directly on the Hub at https://huggingface.co/buckets/username/my-training-bucket

That is the whole loop. Create a bucket, sync your working data into it, check on it when you need to, and save the versioned repo for when something is worth publishing. For one-off operations, hf buckets cp

hf buckets remove

Using Buckets from Python

Everything above also works from Python via huggingface_hub

from huggingface_hub import create_bucket, list_bucket_tree, sync_bucket create_bucket("my-training-bucket", private=True, exist_ok=True) sync_bucket( "./checkpoints", "hf://buckets/username/my-training-bucket/checkpoints", ) for item in list_bucket_tree( "username/my-training-bucket", prefix="checkpoints", recursive=True, ): print(item.path, item.size)

This makes it straightforward to integrate Buckets into training scripts, data pipelines, or any service that manages artifacts programmatically. The Python client also supports batch uploads, selective downloads, deletes, and bucket moves for when you need finer control.

Bucket support is also available in JavaScript via @huggingface/hub

Filesystem integration

Buckets also work through HfFileSystem

huggingface_hub

from huggingface_hub import hffs # List files in a bucket directory hffs.ls("buckets/username/my-training-bucket/checkpoints", detail=False) # Glob for specific files hffs.glob("buckets/username/my-training-bucket/**/*.parquet") # Read a file directly with hffs.open("buckets/username/my-training-bucket/config.yaml", "r") as f: print(f.read())

Because fsspec is the standard Python interface for remote filesystems, libraries like pandas, Polars, and Dask can read from and write to Buckets using hf://

import pandas as pd # Read a CSV directly from a Bucket df = pd.read_csv("hf://buckets/username/my-training-bucket/results.csv") # Write results back df.to_csv("hf://buckets/username/my-training-bucket/summary.csv")

This makes it easy to plug Buckets into existing data workflows without changing how your code reads or writes files.

From Buckets to versioned repos

Buckets are the fast, mutable place where artifacts live while they are still in motion. Once something becomes a stable deliverable, it usually belongs to a versioned model or dataset repo.

On the roadmap, we plan to support direct transfers between Buckets and repos in both directions: promote final checkpoint weights into a model repo, or commit processed shards into a dataset repo once a pipeline completes. The working layer and the publishing layer stay separate, but fit into one continuous Hub-native workflow.

Trusted by launch partners

Before opening Buckets to everyone, we ran a private beta with a small group of launch partners.

A huge thank you to Jasper, Arcee, IBM, and PixAI for testing early versions, surfacing bugs, and sharing feedback that directly shaped this feature.

Conclusion and resources

Storage Buckets bring a missing storage layer to the Hub. They give you a Hub-native place for the mutable, high-throughput side of ML: checkpoints, processed data, Agent traces, logs, and everything else that is useful before it becomes final.

Because they are built on Xet, Buckets are not just easier to use than forcing everything through Git. They are also more efficient for the kinds of related artifacts AI systems produce all the time. That means faster transfers, better deduplication, and on Enterprise plans, billing that benefits from the deduplicated footprint.

If you already use the Hub, Buckets let you keep more of your workflow in one place. If you come from S3-style storage, they give you a familiar model with better alignment to AI artifacts and a clear path toward final publication on the Hub.

Buckets are included in existing Hub storage plans. Free accounts come with storage to get started, and PRO and Enterprise plans offer higher limits. See the storage page for details.

Hugging Face Hubにストレージバケット機能を導入

キーポイント

影響分析

編集コメント

関連記事

Hugging Face Hubにストレージバケット機能を導入

キーポイント

影響分析

編集コメント

関連記事