NVIDIA CCCLにおける浮動小数点演算の決定性制御
NVIDIAは、AI/機械学習やHPC分野における再現性の高い計算を実現するため、CUDA C++ Core Libraries (CCCL) における浮動小数点演算の決定性制御機能を提供している。
キーポイント
決定性計算の重要性
同じ入力データに対して複数回実行しても同じビット単位の結果を生成する決定性計算は、AI/機械学習のトレーニングやデバッグ、科学的シミュレーションにおいて再現性を確保するために重要である。
CCCLにおける決定性制御機能
NVIDIAのCUDA C++ Core Libraries (CCCL) は、浮動小数点演算の決定性を制御する機能を提供しており、開発者が計算の再現性を管理できるようにしている。
実用的な利点と適用分野
この機能は、大規模な分散トレーニングや複雑なHPCワークロードにおいて、結果の一貫性を保証し、デバッグや検証プロセスを効率化する実用的な価値を持つ。
影響分析・編集コメントを表示
影響分析
この記事は、AI開発や科学計算における再現性の重要性を強調し、NVIDIAが提供する決定性制御機能がこれらの分野の信頼性と効率性を向上させる可能性を示している。特に、大規模な分散システムや複雑な計算ワークロードにおいて、結果の一貫性を確保する技術的基盤として注目される。
編集コメント
技術的な深みがあり実用性も高いが、NVIDIA開発者ブログからの情報提供という性質上、業界全体を変えるような重大ニュースとは言い難い。AI/HPC開発者にとっては有用なアップデート。
image同じ入力データを用いて複数回実行した際、ビット単位で同一の結果が得られる場合、その計算は決定性を持つと見なされます。これは一見単純な特性のように思えるかもしれませんが...
原文を表示
A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property to guarantee, it can be difficult to achieve in practice, especially in parallel programming and floating-point arithmetic. This is because floating-point addition and multiplication aren’t strictly associative—that is, (a + b) + c may not equal a + (b + c)—due to rounding that occurs when intermediate results are stored with finite precision.
With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms—added a new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We can use this environment to configure the reduce algorithm’s determinism property. This can only be done through the new single-phase API, since the two-phase API doesn’t accept an execution environment.
The following code shows how to specify the determinism level in CUB (find the complete example online using compiler explorer).
auto input = thrust::device_vector<float>{0.0f, 1.0f, 2.0f, 3.0f}; auto output = thrust::device_vector<float>(1); auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed); auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env); if (error != cudaSuccess) { std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << std::endl; } assert(output[0] == 6.0f);
We begin by specifying the input and output vectors. We then use cuda::execution::require() to construct a cuda::std::execution::env object, setting the determinism level to not_guaranteed.
There are three determinism levels available for reduction, which are:
not_guaranteed
run_to_run
gpu_to_gpu
Determinism not guaranteed
In floating-point reductions, the result can depend on the order in which elements are combined. If two runs apply the reduction operator in different orders, the final values may differ slightly. In many applications, these minor differences are acceptable. By relaxing the requirement for strict determinism, the reduction implementation can rearrange the operations in any order, which can improve runtime performance.
In CUB, not_guaranteed relaxes the determinism level. This enables atomic operations—whose unordered execution across threads results in a different order of operations between runs—to compute both the block-level partial aggregates and the final reduction value. The entire reduction can also be performed in a single kernel launch, since the atomic operations combine the block-level partial aggregates into the result.
The nondeterministic reduce variant is typically faster than the run-to-run deterministic version—particularly for smaller input arrays, where performing the reduction in a single kernel reduces latency from multiple kernel launches, minimizes extra data movement, and avoids additional synchronization. The tradeoff is that repeated runs may yield slightly different results due to the lack of deterministic behavior.
Run-to-run determinism
While nondeterministic reductions offer potential performance gains, CUB also provides a mode that guarantees consistent results across runs. By default, cub::DeviceReduce is run-to-run deterministic, which corresponds to setting the determinism level to run_to_run in the single-phase API. In this mode, multiple invocations with the same input, kernel launch configuration, and GPU will produce identical outputs.
This determinism is achieved by structuring the reduction as a fixed, hierarchical tree rather than relying on atomics, whose update order can vary across runs. At each stage of the reduction, elements are first combined within individual threads. The intermediate results are then reduced across threads within a warp using shuffle instructions, followed by a block-wide reduction using shared memory. Finally, a second kernel aggregates the per-block results to produce the final output. Because this sequence is predetermined and independent of the relative timing of thread execution, the same inputs, kernel configuration, and GPU yield the same bitwise result.
GPU-to-GPU determinism
For applications that require the highest level of reproducibility, CUB also provides GPU-to-GPU determinism, which guarantees identical results across multiple runs with the same input on different GPUs. This mode corresponds to setting the determinism level to gpu_to_gpu.
To achieve this level of determinism, CUB uses a Reproducible Floating-point Accumulator (RFA), a solution based on the NVIDIA GTC 2024 session, Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions. The RFA counters floating-point non-associativity—which arises when adding numbers with different exponents—by grouping all input values into a fixed number of exponent ranges (the default is three bins). This fixed, structured accumulation order ensures the final result is independent of GPU architecture.
The accuracy of the final result depends on the number of bins: more bins provide greater accuracy, but also increase the number of intermediate summations, which can reduce performance. The current implementation defaults the number of bins to three, an optimal default providing balanced performance and accuracy. It’s worth noting that this configuration is not just strictly deterministic, but also guarantees numerically correct results, providing tighter error bounds than the standard pairwise summation traditionally used in parallel reductions.
How results vary based on the determinism levels
The three determinism levels differ in the amount of variation they produce across multiple runs:
Not-guaranteed determinism produces slightly different summation values on each invocation.
Run-to-run determinism ensures the same value for every invocation on a single GPU, but the result may vary if a different GPU is used.
GPU-to-GPU determinism guarantees that the summation value is identical for every invocation, regardless of which GPU executes the reduction.
This is shown in Figure 1, with the summation of an array for each determinism level—represented by green, blue, and red circles—plotted against the run number. A flat horizontal line shows that the reduction produces the same result.
Figure 1. Summation value compared to run
Determinism performance comparison
The level of determinism selected affects the performance of cub::DeviceReduce. Not-guaranteed determinism, with its relaxed requirements, provides the highest performance. The default run-to-run determinism delivers good performance but is slightly slower than not-guaranteed determinism. GPU-to-GPU determinism, which enforces the strictest reproducibility across different GPUs, can significantly reduce performance, increasing execution time by 20% to 30% for large problem sizes.
Figure 2 compares the performance of the different determinism requirements for float32 and float64 inputs on an NVIDIA H200 GPU (lower is better). They clearly show how the choice of determinism level impacts execution time across different data types.
Figure 2. Elapsed time compared to the number of elements
Conclusion
With the introduction of the single-phase API and explicit determinism levels, CUB provides an enhanced toolbox for controlling both the behavior and performance of reduction algorithms. Users can choose the level of determinism that best suits their needs: from the high-performance and flexible, not-guaranteed mode, to the reliable run-to-run default, and up to the strictest GPU-to-GPU reproducibility.
Determinism in CUB isn’t limited to reductions. We plan to extend these capabilities to additional algorithms for developers to control reproducibility across a wider range of parallel CUDA primitives. For updates and discussion, see the ongoing GitHub issue on expanded determinism support, to follow our roadmap, and provide feedback on algorithms you’d like to see deterministic versions of.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み