階層的で再現可能なレシピでKubernetesのGPUインフラを検証する
NVIDIAはGPUクラスターの構築・運用コストを削減するため、検証済み構成を標準化するオープンソースプロジェクト「AI Cluster Runtime」を発表した。
キーポイント
構成の自動化と検証
クラスタ設定をクリティカルパスから外し、バージョンロックされたYAMLレシピで環境依存による手動設定コストと失敗リスクを排除する。
レイヤー型アーキテクチャ
ベース、環境、意図(トレーニング等)、ハードウェアの4層で構成され、より具体的な値が優先順位に基づきマージされる柔軟な設計。
CLIツールによる運用
`aicr snapshot`で現行状態をスナップショットし、`aicr recipe`でターゲット環境に最適化されたレシピを生成・展開するワークフローを提供。
検証済み構成の公開
実際のGPUクラスターで検証されたドライバー、ランタイム、カーネルモジュールの組み合わせを標準化し、クラウドとオンプレ両方のAIファクトリに対応。
段階的検証プロセス
デプロイ前の環境準備確認と、デプロイ後のCNCF準拠チェック(DRAやガンスケジューリングなど)の2段階で、レシピの制約とコンポーネント健全性を検証する。
自動化されたパッケージ化と多様なデプロイ対応
レシピは依存関係に基づいて順序付けられたデプロイ可能パッケージに変換され、スクリプト実行、ArgoCDマニフェスト生成、またはオフライン環境向けのOCIイメージ公開などに対応する。
継続的更新とバージョン管理による安全な運用
レシピは内部検証パイプラインを通じて継続的に更新され、バージョン管理によりアップグレード前の差分確認が可能であり、コミュニティ貢献や組織固有のプライベート拡張もサポートしている。
影響分析・編集コメントを表示
影響分析
本プロジェクトは、GPUクラスターの構築に費やす莫大な手動設定コストと環境依存による失敗リスクを、標準化されたレシピ方式で解決する。これにより、AIエンジニアはインフラのセットアップに悩まされず、モデルトレーニングや推論の本質的な最適化に集中できる。業界全体としてKubernetes上でのGPU運用のハードルが下がり、AIインフラの標準化と再現性が加速すると期待される。
編集コメント
NVIDIAが公開したAI Cluster Runtimeは、GPUクラスターの「設定の呪い」を解消する実用的な解決策だ。オープンソース化によりコミュニティ主導での改善が見込まれるため、Kubernetes上でAI基盤を運用するチームは早期に検証・導入を検討すべきだろう。
imageKubernetes上で動作するあらゆるAIクラスターには、低レベルのドライバーやカーネル設定から高レベルのオペレーターまで、全体が連携して動作する完全なソフトウェアスタックが必要です。
原文を表示
Every AI cluster running on Kubernetes requires a full software stack that works together, from low-level driver and kernel settings to high-level operator and workload configurations. You get one cluster working, and spend days getting the next one to match. Upgrade a component, and something else breaks. Move to a new cloud and start over. AI Cluster Runtime is a new open-source project designed to remove cluster configuration from the critical path. It publishes optimized, validated, and reproducible Kubernetes configurations as recipes you can deploy onto your clusters.
How AI Cluster Runtime works
To support GPU clusters across cloud and on-premises AI factories, NVIDIA validates specific combinations of drivers, runtimes, operators, kernel modules, and system settings for AI workloads. AI Cluster Runtime publishes those results as recipes. These version-locked YAML files capture which components were tested, the versions, and the configuration values, for a given environment. Recipes also carry constraints (minimum Kubernetes version, required OS, kernel version) and a computed deployment order based on component dependencies. Every recipe is validated against real clusters and reproducible across environments.
You can browse recipes directly in the repository, query them through a REST API, or use the aicr CLI to generate one for your target environment and render it into Helm charts and manifests ready for deployment.
Capture your cluster state
If you have a running cluster, you can snapshot its state before generating a recipe. This captures OS release, kernel version, GPU hardware and driver, Kubernetes version, and installed operators.
aicr snapshot \ --node-selector nodeGroup=gpu-worker \ --output cm://gpu-operator/aicr-snapshot
This deploys a short-lived Job onto a target node, collects system measurements, and writes the results to a ConfigMap or local file. The snapshot becomes the baseline that the validation checks against.
Generate a recipe
The recipe command takes a description of your target environment and matches it against a library of validated overlays to produce a single recipe with exact component versions and settings.
aicr recipe \ --service eks \ --accelerator h100 \ --intent training \ --os ubuntu \ --platform kubeflow \ --output recipe.yaml
Recipes are composed from layers rather than maintained as monolithic configurations. These include:
Base layers, which define universal components and default versions.
Environment layers, which add Kubernetes-specific components—for example, the EBS CSI driver and EFA plugin on Amazon EKS.
Intent layers, which configure training-optimized component settings and NVIDIA Collective Communications Library (NCCL) tuning parameters.
Hardware layers, which pin driver versions and enable features such as CDI and GDRCopy for specific accelerators.
Each layer is added in order, with more specific values taking precedence over general ones.
A fully specialized recipe (such as NVIDIA Blackwell + EKS + Ubuntu + training + Kubeflow) carries up to 268 configuration values across 16 components. A generic EKS query returns 200. The delta between training and inference intent can swap 5 components and change 41 configuration values, producing completely different deployment stacks from the same base. This kind of variance is exactly why people end up hand-tuning clusters.
Validate
Validation runs in phases. Prior to deploying anything, a readiness check compares recipe constraints against your snapshot: Kubernetes version, OS, kernel, and GPU hardware.
aicr validate \ --recipe recipe.yaml \ --phase readiness
After deployment, subsequent phases validate component health and conformance. The conformance phase checks against standards like the CNCF’s Certified Kubernetes AI Conformance Program, verifying requirements for dynamic resource allocation (DRA), gang scheduling, and job-level networking.
Create a bundle
The bundler turns a recipe into deployable artifacts.
aicr bundle \ --recipe recipe.yaml \ --system-node-selector nodeGroup=system-pool \ --accelerated-node-selector nodeGroup=gpu-worker \ --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \ --output ./bundles
The output is a directory with one folder per component, each containing values.yaml, integrity checksums, a README, and optional custom manifests.
Components are ordered by their dependency graph (for example, cert-manager before NVIDIA GPU Operator, and the NVIDIA GPU Operator before Kubeflow Trainer). Deploy using the included deploy.sh script, generate ArgoCD Application manifests with --deployer argocd, or publish bundles as OCI images for air-gapped environments.
Stay current with AI Cluster Runtime recipes
Recipes update as the NVIDIA internal validation pipelines run. New component releases, driver updates, and kernel parameter changes all flow into published recipes as they are tested. When a particular NCCL setting improves Blackwell throughput, that lands in the next recipe version.
Because every recipe is versioned, you can diff your current deployment against the latest validated configuration and see exactly what changed before upgrading.
Contributing recipes
Designed for collaboration from the start, the project enables CSPs, OEMs, platform teams, and individual operators to help validate diverse hardware, OS, and Kubernetes distribution combinations.
Contribute a recipe. Copy an existing overlay, update the criteria and configuration for your environment, run make test, and open a PR. The recipe development guide walks through the process.
Extend privately. The --data flag overlays external recipe directories at runtime, so you can maintain organization-specific configurations alongside public ones without forking.
File issues. Share which environments matter to you. That directly shapes what gets validated next.
Get started with AI Cluster Runtime
AI Cluster Runtime is available on GitHub as an alpha release. It includes the aicr CLI, an API server, a cluster agent, and validated recipes covering training and inference workloads on Kubernetes (e.g., Amazon EKS) with NVIDIA H100 and NVIDIA Blackwell accelerators running Ubuntu 24.04.
Training recipes target Kubeflow Trainer and inference recipes target NVIDIA Dynamo. Every release includes SLSA Level 3 provenance, signed SBOMs, and image attestations.
Projects are in development to expand AI Cluster Runtime across additional platforms, accelerators, and workload types. Tune in to the Operating Cloud AI Factories at Scale session at NVIDIA GTC 2026 to learn more about AI Cluster Runtime and other products that can scale AI operations.
関連記事
NVIDIA Cosmos World Foundation Modelsによる合成データのスケーリングと物理AI推論
ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング(9 分読了)
ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み