約30行のPythonとNVIDIA nvCOMPでチェックポイントコストを削減
NVIDIAの開発者ブログは、大規模言語モデルの学習におけるチェックポイントのストレージコストを削減するために、約30行のPythonコードとNVIDIA nvCOMPライブラリを活用する方法を紹介している。
キーポイント
LLM学習におけるチェックポイントの課題
大規模言語モデルの学習では、モデルの重み、オプティマイザの状態、勾配を含む完全なスナップショットを定期的に保存する必要があり、これがストレージコストの大きな要因となっている。
コスト削減ソリューション
約30行のPythonコードとNVIDIAのnvCOMP圧縮ライブラリを使用することで、チェックポイントのストレージ要件を大幅に削減できる。
実装の簡便性
比較的少ないコード行数で実装可能なため、既存の学習パイプラインへの統合が容易である。
NVIDIA技術の活用
ソリューションはNVIDIAのGPU向け圧縮ライブラリであるnvCOMPに依存しており、同社のハードウェアエコシステム内での最適化を図っている。
影響分析・編集コメントを表示
影響分析
この記事は、LLM学習の実運用コストを削減する実践的な技術ソリューションを提供しており、特に大規模モデルを訓練する組織にとって直接的な経済的メリットをもたらす可能性がある。ただし、NVIDIA固有の技術に依存するため、プラットフォームの汎用性には限界がある。
編集コメント
開発者向けの実践的なチュートリアル記事であり、技術的な深みよりも実装の簡便性をアピールしている点が特徴。NVIDIAエコシステムのユーザーには即戦力となる内容。

大規模言語モデル(LLM)のトレーニングには、定期的なチェックポイントの作成が必要です。モデルウェイト、オプティマイザ状態、勾配を含む完全なスナップショットをストレージに保存することで、トレーニングを中断した場所から再開できるようになります...
原文を表示
Training LLMs requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage so training can resume after interruptions. At scale, these checkpoints become massive (782 GB for a 70B model) and frequent (every 15-30 minutes), generating one of the largest line items in a training budget. Most AI teams chase GPU utilization, training throughput, and model quality. Almost none look at what checkpointing is costing them.
This is an expensive oversight. The synchronous checkpoint overhead of a 405B model on 128 NVIDIA Blackwell GPUs alone can cost $200,000 a month. By introducing a lossless compression step implemented with about 30 lines of Python, we can reduce storage costs by $56,000 every month. Mixture of experts (MoE) models save even more. We’ll break down how we got to that calculation and how NVIDIA nvComp can improve checkpointing efficiency in this blog post.
Inside a single checkpoint
Hardware interruptions at 1000+ GPU scale aren’t rare. Meta reported 419 unexpected interruptions across 54 days of Llama 3 training on 16,384 NVIDIA H100 GPUs (~one every 3 hours). This is why most teams checkpoint every 15-30 minutes; it’s load-bearing infrastructure, not optional overhead.
Figure 1. Checkpoint composition
ComponentSizeContentsModel weights (BF16)130 GB70B params × 2 bytesOptimizer state (FP32 momentum + variance)521 GB70B params × (4 + 4) bytesGradients (BF16)130 GBSame shape as weightsTotal per checkpoint782 GBTable 1. Checkpoint composition
This breakdown surprises people who see it for the first time. The optimizer state—AdamW’s first and second moment estimates, both stored in FP32—is 4x larger than the model weights. It’s the bulk of every checkpoint, and why they are far larger than a deployed model.
By checkpointing every 30 minutes, standard practice for fault tolerance, that’s 48 checkpoints per day. Over a month of continuous training:
782 GB × 48/day × 30 days = 1.13 PB written to storage per month
And here’s the part most teams miss. During every single synchronous checkpoint write, all 8 GPUs sit completely idle. Nothing overlaps with a checkpoint save—the training loop blocks until the last byte hits storage.
At $4.40/GPU/hour (representative of on-demand Blackwell GPU cloud pricing) and 5 GB/s shared storage throughput (typical for Lustre or GPFS over InfiniBand), we do the math to figure out the cost of idle GPUs during these waiting periods:
Write time per checkpoint: 782 GB / 5 GB/s = 156.4 seconds (~2.6 minutes)
Total wait time per month: 156.4s × 48/day × 30 days = 225,216 seconds = 62.6 hours
Cost of idle GPUs during these waiting periods: 62.6 hours × 8 GPUs × $4.40/GPU/hour = $2,203/month
The wait time for synchronous checkpoints adds up to over $2,200/month—before counting storage fees.
Scale that to a 64-GPU cluster, and the monthly cost jumps to over $17,500. At 128 GPUs, training a 405B model, idle costs exceed $200,000/month. The idle-GPU cost dominates storage fees by an order of magnitude.
Asynchronous checkpointing eases part of the problem. However, framework support is still maturing and not widely adopted, plus managing memory watermarks remains a challenge. A complementary technique that can be easily used is checkpoint compression, which has the added benefit of reducing cold start time when the state is being restored, since cold start is serial in nature.
NVIDIA nvCOMP introduces GPU-accelerated compression
The core idea is simple. Compress the checkpoint before it leaves GPU memory. No CPU roundtrip. No extra data movement. The data is already on the GPU, so we can compress it there.
NVIDIA nvCOMP is a GPU-accelerated lossless compression library that does exactly this. By providing a single library with support for both standard algorithms, like Zstandard (ZSTD), and highly optimized, GPU-specific formats, like gANS, it tackles data bottlenecks natively on the device. Developers can easily integrate high-throughput compression directly into their Python workflows (such as PyTorch or TensorFlow).
Measured checkpoint compression ratios
We fine-tuned two model architectures (dense transformer and mixture of experts) for 50 steps, saved full training checkpoints (weights + AdamW optimizer state + gradients), and compressed every component with nvCOMP on NVIDIA H200 and Blackwell GPUs. The compression ratios depend on data, not hardware—they’re identical across all GPUs.
ZSTD is a widely adopted general-purpose compression algorithm developed by Meta that balances strong compression ratios with reasonable speed. It’s the same algorithm behind ZSTD on the Linux command line and is used extensively in databases, file systems, and data pipelines. Asymmetric numeral systems (ANS) is a modern entropy coding technique that nvCOMP implements as a GPU-native codec (gANS) optimized for raw throughput. Both are lossless and exploit statistical patterns in 1B/2B words distributions (entropy coding) rather than just matching repeated byte sequences.
The key difference is the trade-off. ZSTD squeezes slightly higher compression ratios (1-2% higher than ANS in our benchmarks) but compresses at ~16-19 GB/s on Blackwell GPUs. ANS achieves nearly the same ratios at 10x the throughput (181-190GB/s). As we’ll show below, which to pick depends on how fast your storage is.
Figure 2. Checkpoint compression ratios for dense and MoE models
Checkpoint component% of CkptZSTDgANSWhyBF16 model weights17%1.27-1.28×1.25-1.27×High-entropy trained floatsFP32 optimizer momentum33%1.25-1.39×1.23-1.37×Shared exponent bytes in moving averagesFP32 optimizer variance33%1.32-1.48×1.30-1.45×Non-negative, small valuesBF16 gradients17%1.25-1.46×1.23-1.44×Architecture-dependent sparsityFull checkpoint (dense)100%~1.27×~1.25×Dense: no natural sparsityFull checkpoint (MoE)100%~1.40×~1.39×MoE: expert routing creates gradient sparsityTable 2. Checkpoint compression ratios for dense and MoE models
The ranges reflect the key finding that compression depends on model architecture, not hardware. Not all compression algorithms work on floating-point tensors. Byte-level codecs like LZ4 and Bitcomp look for repeated byte sequences—like finding duplicate words in a document. But trained neural network parameters look essentially random at the byte level, so these codecs find almost nothing to compress (~1.00× on dense checkpoints in our benchmarks).
ZSTD and ANS use entropy coding, which exploits statistical patterns in how frequently certain byte values occur—even when no exact sequences repeat. This is why they achieve 1.25-1.40x with the same data, where byte-level codecs achieve nothing. Between the two, ZSTD delivers slightly stronger ratios (1-2% higher in our benchmarks), while ANS matches it closely at 10x the compression throughput. This trade-off matters when storage gets faster, as we’ll show.
Dense transformers (Llama, GPT, Qwen): All parameters participate in every forward pass. ~0% exact zeros → ~1.25× ANS, ~1.27× ZSTD.
MoE models (Mixtral, DeepSeek, OLMoE): Only a subset of experts activate per token. 12-14% exact zeros → ~1.39× ANS, ~1.40× ZSTD.
Our benchmarks use BF16 weights and FP32 optimizer state (AdamW), which is standard for most large-scale training today. Teams using FP8 training (e.g., with NVIDIA Transformer Engine) will see lower compression ratios, as reduced-precision data carries higher entropy and less statistical redundancy for lossless compression to exploit. At FP4 (NVFP4), quantization already removes most redundancy—lossless compression provides negligible additional benefit (~5%). The optimizer state, however, remains in FP32 regardless of weight precision, and that’s where the bulk of checkpoint size and compression savings come from.
The math: How nvCOMP saves money
Using our 70B dense model with measured ZSTD 1.27× ratio:
Without nvCOMP: Write 782 GB to disk at 5 GB/s → 156 seconds of GPU wait
With nvCOMP ZSTD (1.27x): Compress 782 GB at ~16 GB/s (~49s), write 616 GB at 5 GB/s (123s) → ~123 seconds of GPU wait
Why does the 49-second compression time disappear? Because compression and storage writes can be pipelined: while one chunk writes to disk, the next chunk compresses on the GPU. As long as the codec compresses faster than storage can absorb the output, the compression step is fully hidden behind the write — the GPU wait equals the write time alone. At 5 GB/s shared storage, ZSTD at 16 GB/s is 3× faster than the write, so compression overlaps completely. The wait drops from 156 to 123 seconds—21% smaller files, 33 seconds faster per checkpoint. Over a month: 48 × 30 = 47,520 fewer seconds of idle time — 13+ hours reclaimed. MoE at 1.40× does even better: 29% smaller, 44 seconds faster, 17+ hours reclaimed.
When storage gets faster, codec throughput matters
Your storage speed depends on infrastructure: 2-10 GB/s for shared network filesystems (Lustre, GPFS, NFS), or 15-50+ GB/s for GPUDirect Storage (GDS) with local NVMe.
At faster storage, codec throughput determines whether compression helps or hurts:
Figure 3. Storage speed and codec throughput
Storage speedNo compressionZSTD (1.27×, ~16 GB/s)ANS (1.25×, ~181 GB/s)Winner5 GB/s156s123s (−21%)125s (−20%)ZSTD15 GB/s52s49s (−6%)42s (−20%)ANS25 GB/s31s49s (+57%) ⚠️25s (−20%)ANSTable 3. Storage speed and codec throughput
To illustrate, we compare checkpoint write times for a 70B dense model (782 GB) on a Blackwell GPU across three storage speeds. Because compression and writing are pipelined — one chunk compresses while the previous chunk writes to disk — the total GPU wait time equals whichever stage is slower: pipelined wait = max(compress time, write time).
In table 3 ~16 GB/s on a Blackwell GPU, ZSTD’s compression step becomes the bottleneck, and wait time increases. ANS at 181-190 GB/s never hits this wall. At 25 GB/s with 64 Blackwell GPUs, ANS nets ~$2,800/month while ZSTD nets only ~$240/month. For shared file systems (5-10 GB/s), ZSTD remains the right default. For GDS/high-performance storage at 15+ GB/s, ANS is the clear winner.
Measured savings in Figure 4 assume GPU wait reduction + storage at $0.14/GB/month, 96 retained checkpoints, and 5 GB/s shared storage:
Figure 4. Measured monthly savings with nvCOMP checkpoint compression
Dense models (measured on Qwen2.5-3B):
ModelGPUsCkptZSTD (1.27×)ANS (1.25×)Llama 3 8B8× Blackwell GPUs89 GB70 GB / ~$300/mo71 GB / ~$290/moLlama 3 70B8× Blackwell GPUs782 GB616 GB / ~$2,700/mo626 GB / ~$2,500/moLlama 3 70B64× Blackwell GPUs782 GB616 GB / ~$6,000/mo626 GB / ~$5,600/moLlama 3 405B128× Blackwell GPUs4,529 GB3,566 GB / ~$56,000/mo3,623 GB / ~$53,000/moTable 4. Measured monthly savings with nvCOMP checkpoint compression for dense models
MoE models (measured on OLMoE-1B-7B):
ModelGPUsCkptZSTD (1.40×)ANS (1.39×)Mixtral 8x22B (141B)64× Blackwell GPUs1,575 GB1,125 GB / ~$16,000/mo1,133 GB / ~$16,000/moDeepSeek-V3 (671B)256× Blackwell GPUs7,490 GB5,350 GB / ~$222,000/mo5,389 GB / ~$218,000/moTable 5. Measured monthly savings with nvCOMP checkpoint compression for MoE models
The savings scale with model size (bigger checkpoints = more to compress) and GPU count (more GPUs idle during waits = higher cost per second of wait time). The second factor is what makes this particularly brutal at scale—256 idle Blackwell GPUs cost $1,126/hour. Every second you shave off a checkpoint write saves money.
The industry is moving toward MoE architectures (like DeepSeek-V3, Mixtral, and Grok), which produce larger and more compressible checkpoints. Compression isn’t a nice-to-have anymore, and adding a few lines of Python code can save costs.
Integration: ~30 Lines of Python
The integration effort is minimal. Here’s a drop-in replacement for torch.save / torch.load:
CUDA 13.x (Blackwell):pip install nvidia-nvcomp-cu13 cupy-cuda13x# CUDA 12.x (Hopper):pip install nvidia-nvcomp-cu12 cupy-cuda12x
import torchimport cupy as cpfrom nvidia import nvcompcodec = nvcomp.Codec(algorithm="Zstd") def _compress_state(obj): if isinstance(obj, torch.Tensor) and obj.is_cuda: flat = obj.contiguous().reshape(-1).view(torch.uint8) compressed = codec.encode(nvcomp.as_array(cp.from_dlpack(flat))) return {"__nvcomp__": True, "data": bytes(compressed.cpu()), "shape": list(obj.shape), "dtype": str(obj.dtype)} if isinstance(obj, dict): return {k: _compress_state(v) for k, v in obj.items()} if isinstance(obj, (list, tuple)): return type(obj)(_compress_state(v) for v in obj) return objdef _decompress_state(obj, device): if isinstance(obj, dict) and obj.get("__nvcomp__"): decompressed = codec.decode(obj["data"]) dtype = getattr(torch, obj["dtype"].replace("torch.", "")) return torch.from_dlpack(cp.asarray(decompressed)).view(dtype).reshape(obj["shape"]) if isinstance(obj, torch.Tensor): return obj.to(device) if isinstance(obj, dict): return {k: _decompress_state(v, device) for k, v in obj.items()} if isinstance(obj, (list, tuple)): return type(obj)(_decompress_state(v, device) for v in obj) return objdef save_compressed_checkpoint(model, optimizer, path): state = {"model": model.state_dict(), "optimizer": optimizer.state_dict()} torch.save(_compress_state(state), path)def load_compressed_checkpoint(path, device="cuda"): return _decompress_state( torch.load(path, map_location="cpu", weights_only=True), device)
Swap torch.save for save_compressed_checkpoint, and torch.load for load_compressed_checkpoint.
That’s it. No changes to your training loop, model code, or optimizer configuration.
If you’re using a training framework with custom checkpoint hooks (DeepSpeed, Megatron), the same pattern applies. Walk the state dict, compress GPU tensors, and serialize.
For teams using NVIDIA GPUDirect Storage (GDS), there’s an even faster path. nvCOMP can compress directly into a GDS buffer, writing compressed data straight from GPU memory to NVMe storage. There’s zero CPU involvement.
NVIDIA Blackwell Decompression Engine
NVIDIA Blackwell GPUs include a dedicated Blackwell Decompression Engine (DE) that decompresses LZ4, Snappy, and Deflate at up to 280 GB/s with zero SM overhead. However, the byte-level codecs achieve ~1.00x on floating-point tensors. For checkpoint compression, the entropy-based codecs (ANS and ZSTD) on SMs deliver the real savings. During restores, GPUs are idle and waiting for data before resuming training. SM availability isn’t a constraint, and ANS on SMs delivers comparable throughput (247-264 GB/s) while producing significantly smaller files.
Get started
Checkpoint compression is one of the highest-ROI optimizations you can add to your training pipeline and one of the easiest:
Install nvCOMP: pip install nvidia-nvcomp-cu13 (Blackwell) or pip install nvidia-nvcomp-cu12 (Hopper)
Try it now: Drop the save_compressed_checkpoint / load_compressed_checkpoint functions from this post into your training loop. No other changes needed.
Explore the code: Samples and benchmarks on GitHub.
Read the docs: nvCOMP documentation
Go deeper: For teams using GPUDirect Storage, nvCOMP can compress directly into GPU buffers that GDS writes to NVMe with zero CPU involvement. For details, see:
https://docs.nvidia.com/cuda/nvcomp/
https://docs.nvidia.com/gpudirect-storage/api-reference-guide/
Your checkpoints are the largest files in your training pipeline. Compress them.
関連記事
プロテオーム規模でのタンパク質構造予測を加速する方法
NVIDIAが、タンパク質が単体ではなく相互作用によって生物学的プロセスを制御する点に着目し、プロテオーム規模でのタンパク質構造予測を高速化する手法を提案している。
Kubernetes上でSlurmを使用した大規模GPUワークロードの実行
NVIDIAが、オープンソースのクラスタ管理システムSlurmをKubernetesと統合し、大規模GPUワークロードを効率的に管理・スケジューリングする方法を紹介している。SlurmはTOP500システムの65%以上で採用されている実績を持つ。
NVIDIA GPUに対する新たなRowhammer攻撃がシステム完全制御を可能に
セキュリティ研究者が、NVIDIA GPUを標的とした新たなRowhammer攻撃を実証し、メモリ破損からシステム全体の侵害にエスカレートできることを示した。