Kubernetes上でSlurmを使用した大規模GPUワークロードの実行
NVIDIAの開発者ブログは、大規模GPUワークロードをKubernetes上でSlurmを用いて管理・スケジューリングする方法について解説し、高性能計算とクラウドネイティブ技術の統合の実用性を示している。
キーポイント
SlurmとKubernetesの統合による大規模GPU管理
高性能計算で広く使われるオープンソースのジョブスケジューラSlurmを、コンテナオーケストレーションプラットフォームKubernetes上で動作させることで、大規模なGPUワークロードの効率的な管理を実現する。
TOP500システムの実績に基づく信頼性
Slurmは世界のスーパーコンピュータTOP500システムの65%以上で採用されており、大規模で複雑な計算ジョブのスケジューリングにおける実績と信頼性が高い。
クラウドネイティブ環境でのHPCワークロード実行
従来のオンプレミスHPCクラスタ環境だけでなく、Kubernetesを基盤とするクラウドネイティブ環境でも高性能計算ワークロードを柔軟に実行できる可能性を示している。
影響分析・編集コメントを表示
影響分析
この記事は、高性能計算とクラウドネイティブ技術の融合という重要なトレンドを示しており、AI研究開発における大規模計算リソースの柔軟な活用を促進する可能性がある。特に、既存のHPCインフラと現代のクラウド技術を橋渡しすることで、AIワークロードの効率化とスケーラビリティ向上に寄与する。
編集コメント
技術的な統合方法に焦点を当てた実践的な内容で、AIインフラエンジニアにとって有用な情報を提供している。ただし、具体的なベンチマークや導入事例の詳細が不足している点が惜しまれる。
imageSlurmは、Linux向けのオープンソースのクラスタ管理およびジョブスケジューリングシステムです。TOP500システムの65%以上でジョブスケジューリングを管理しています。多くの組織では...
原文を表示
Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments.
Slinky, an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration:
slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods
slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods
This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters. It walks through the architecture of the operator and how it maps Slurm daemons to Kubernetes primitives, then covers deployment—including how Slinky slurm-operator integrates with your existing infrastructure. It also covers the Kubernetes ecosystem integrations that make this model practical. Finally, we share lessons from running Slinky in production at NVIDIA on clusters with over 1,000 GPU worker nodes and 8,000+ GPUs.
How does Slinky slurm-operator work?
Slinky slurm-operator represents each Slurm component (slurmctld for scheduling, slurmdbd for accounting, slurmd for compute workers, slurmrestd for API access) as a Kubernetes Custom Resource Definition (CRD). A Slurm cluster is defined using Custom Resources, and Slinky creates containerized Slurm daemons running in their own pods, configured to belong to their respective cluster.
Figure 1. Slinky CRDs and their referential relationships
Slinky ensures high availability (HA) of the Slurm control plane (slurmctld) through pod regeneration, with no need for the Slurm native HA mechanism. Configuration changes propagate automatically: Kubernetes synchronizes mounted configuration (ConfigMaps and Secrets) into the control plane pod, which detects the changes and propagates the new configuration to its workers (slurmd) with zero scheduler downtime.
Using Slurm native OpenMetrics support and Prometheus monitoring (since Slurm v25.11), workers can be autoscaled through the HorizontalPodAutoscaler (HPA) based on cluster metrics and your desired scaling policy, from a single pod to every available worker node. On scale-in, Slinky fully drains Slurm nodes before terminating pods, ensuring running workloads complete first. Slinky prioritizes pods whose workloads will complete soonest for scale-in. The same drain-before-terminate process applies when rolling out new worker pod images (updated Slurm versions or OS patches, for example), so upgrades do not interrupt running jobs.
How to deploy Slinky slurm-operator
Slurm clusters deployed with Slinky on Kubernetes work similarly to noncontainerized Slurm deployments. Slinky slurm-operator automatically enables the Slurm features required for containerized operation:
configless mode for config distribution without shared filesystems
dynamic nodes so workers register on startup without being predefined in slurm.conf
auth/slurm with use_client_ids for cluster-wide user authentication without per-node identity services
Figure 2. A full Slurm cluster deployed on Kubernetes with Slinky, including login pods, worker pod autoscaling, job accounting, and integration with identity and database services
Slinky works with your existing database and identity infrastructure. Slurm accounting (slurmdbd) connects to any MySQL- or MariaDB-compatible database, whether running within the cluster or as an external managed service. Identity management for login pods works through SSSD, so Active Directory, LDAP, or any other supported backend integrates without changes. Slurm 25.11 allows slurmd to enforce resource isolation (cgroups v2) within its container. This means mult-user concurrent workloads on the same worker are fully isolated.
Slinky provides reference container images and Docker files for each Slurm component. You can use these directly, extend them with your own libraries and dependencies, or build entirely custom images as long as they include a functioning Slurm installation.
What is the benefit of running Slurm on Kubernetes?
The operational payoff of running Slurm on Kubernetes comes from the ecosystem. Rather than building and maintaining separate toolchains for GPU management, monitoring, networking, and node lifecycle, you can use the Kubernetes tooling that already exists for these problems. Platform teams manage clusters with declarative YAML, Helm deployments, rolling updates, and Prometheus or Grafana for observability.
Automate GPU management with the NVIDIA GPU Operator
The NVIDIA GPU Operator automates GPU driver installation, container runtime configuration, and device plugin deployment, giving worker pods automatic access to GPUs.
The NVIDIA GPU Operator also deploys DCGM Exporter, a GPU telemetry collector. With Slinky DCGM integration and HPC job mapping support, you can enable per-job GPU metrics labeled with Slurm job IDs, giving you workload-level GPU monitoring across your cluster.
To enable this, add the following to the NVIDIA GPU Operator Helm values:
dcgmExporter: hpcJobMapping: enabled: true directory: /var/lib/dcgm-exporter/job-mapping
Worker pods mount the same directory, allowing Slurm prolog/epilog scripts (which run at job start and completion) to write job mapping files that DCGM Exporter picks up automatically.
Multinode NVIDIA NVLink with ComputeDomains
For GPU architectures like NVIDIA GB200 NVL72, where GPUs communicate across nodes through multinode NVIDIA NVLink, Slinky enables access to ComputeDomains. This is a Kubernetes abstraction provided by the NVIDIA DRA driver that dynamically manages Internode Memory Exchange (IMEX) domains for high-bandwidth GPU-to-GPU connectivity.
In NVIDIA deployments, Slinky uses a cluster-wide ComputeDomain (numNodes: 0) that automatically adds and removes nodes from the IMEX domain as they join or leave the cluster. This provides:
Block topology awareness: Slinky integrates with Topograph, which automatically discovers the GPU block topology of your cluster using Dynamic Resource Allocation (DRA) and labels published by the NVIDIA GPU Operator.
IMEX plugin: Slurm is configured with SwitchType = switch/nvidia_imex to use NVIDIA IMEX for cross-node GPU communication.
Topology-aware scheduling: Slurm 25.11 supports TopologyParam=BlockAsNodeRank with TopologyPlugin=topology/block, ensuring allocations are sorted so applications can discover segments by node rank.
Distributed training jobs achieve full NVLink bandwidth across node boundaries while ComputeDomains automatically create and tear down IMEX domains as workloads run to completion.
Bidirectional state synchronization
Slinky synchronizes state between Kubernetes and Slurm in both directions. Slurm node states (Idle, Allocated, Mixed, Down, Drain, Completing, Not Responding) are reflected as pod conditions, giving operators visibility into Slurm state through standard Kubernetes tooling.
When you take a node out for maintenance using kubectl drain, the drain state and reason automatically sync to Slurm. Pods running active workloads are protected by PodDisruptionBudgets, so jobs in progress are not accidentally evicted.
Slinky slurm-operator at scale
NVIDIA runs Slinky slurm-operator in production across multiple clusters, with some deployments scaling to over 8,000 GPUs. These clusters run large-scale LLM training and multinode inference workloads daily. GPU communication benchmarks (NCCL all-reduce and all-gather) match the performance of noncontainerized Slurm deployments, with no measurable impact from the Kubernetes layer Slinky runs on. New clusters go from zero to running jobs in hours using Helm charts and declarative config, with no manual provisioning. At this scale, the operational benefits of running Slurm on Kubernetes compound with:
Unified monitoring: Prometheus scrapes Slurm metrics (jobs, nodes, partitions, scheduler latency) alongside standard Kubernetes metrics, with Grafana dashboards showing both in a single view. No separate monitoring stack for HPC.
Automated remediation: When health checks flag a node as unhealthy, the state syncs from Kubernetes to Slurm automatically, with the same reason visible in both systems. Slurm drains the node, jobs reschedule to healthy hardware, and teams no longer need to coordinate between K8s and Slurm tooling separately.
Nondisruptive rolling updates: Updating hundreds of worker pod images used to require downtime. With PodDisruptionBudgets protecting running jobs, we now roll out Slurm version updates and OS patches while training jobs continue uninterrupted on the remaining capacity.
Maintenance coordination: Standard Kubernetes node drain syncs to Slurm. Jobs complete gracefully, maintenance proceeds, and bringing the node back online restores it to the Slurm cluster automatically.
Familiar tooling: Platform teams use kubectl for day-to-day operations. Pod conditions expose Slurm states (Idle, Allocated, Drain) without needing Slurm CLI access, which means on-call engineers can triage Slurm issues with the same tools they use for everything else.
One constraint to be aware of: Slinky’s slurm-operator currently assumes one worker pod per node. If your workloads are exclusively single-node Slurm jobs, this over-provisions relative to what you need, and a lighter-weight integration (or slurm-bridge) may be a better fit. The Slinky deployment model pays off in multinode job scheduling.
Slinky slurm-operator v1.1.0 release highlights
The recently released slurm-operator v1.1.0 addresses several gaps that matter for production deployments:
Dynamic topology support: Prior to v1.1.0, Slurm topology awareness required static configuration. In v1.1.0, dynamic nodes will register with their topology based on the Kubernetes node the worker pod runs on, which means topology-aware scheduling works correctly as pods move across the cluster.
DaemonSet-style worker pod scaling: Prior to v1.1.0, worker pods scaled through replica count. DaemonSet-style scaling ties pods to their nodeSelector instead, making Slurm-to-Kubernetes node mapping statically 1:1. This simplifies operations for clusters where every GPU node should always run a Slurm worker.
Unregistered worker pod remediation: Worker pods that are healthy but not registered to their Slurm cluster are automatically recreated, closing a gap in self-healing behavior.
In the longer term, the team is working on graceful Slurm cluster upgrades, planned outage workflows, configuration rollback, and structured daemon logging. The goal is to make the Slinky operational model match what platform teams expect from any production Kubernetes workload: declarative, self-healing, and observable.
Get started with the Slinky slurm-operator
Slinky is open source and available today. Install the slurm-operator through Helm, define your Slurm cluster as a Custom Resource, and you can have jobs running on Kubernetes in under an hour. The slurm-operator Quick Start Guide walks you through the full setup.
Kubernetes is the right infrastructure substrate for GPU computing, and Slurm is an extremely powerful job scheduling layer for the workloads running on top of it. Slinky makes that combination work seamlessly in production. We plan to continue investing in making it simple to run large-scale AI training infrastructure. If you have questions or want to share what you’re building, visit SlinkyProject on GitHub.
関連記事
プロテオーム規模でのタンパク質構造予測を加速する方法
NVIDIAが、タンパク質が単体ではなく相互作用によって生物学的プロセスを制御する点に着目し、プロテオーム規模でのタンパク質構造予測を高速化する手法を提案している。
約30行のPythonとNVIDIA nvCOMPでチェックポイントコストを削減
NVIDIAが、LLM学習時のチェックポイント保存コストを削減するPythonスクリプトを公開した。約30行のコードでモデル重み・オプティマイザ状態・勾配の圧縮保存を実現し、ストレージコストとI/O負荷を低減できる。
NVIDIA GPUに対する新たなRowhammer攻撃がシステム完全制御を可能に
セキュリティ研究者が、NVIDIA GPUを標的とした新たなRowhammer攻撃を実証し、メモリ破損からシステム全体の侵害にエスカレートできることを示した。