NVIDIA DGX Sparkによる自律AIエージェントとワークロードのスケーリング
NVIDIAは、自律型AIエージェントの長期間実行される複雑なワークロードを効率的にスケーリングするためのソリューションとして、DGX Sparkを開発した。
キーポイント
自律型AIエージェントの課題
自律型AIエージェントは、複数の通信チャネルを使用する長時間実行タスクを管理する必要があり、従来のインフラではスケーリングが困難である。
DGX Sparkの役割
NVIDIA DGX Sparkは、このような自律型AIエージェントとそのワークロードを効率的にスケーリングするために設計されたソリューションである。
技術的アプローチ
DGX Sparkは、AIエージェントの複雑なワークロードを管理し、リソースを最適化することで、大規模なAI運用を可能にする。
次世代AIイノベーションへの貢献
この技術は、自律型AIエージェントが次の波のAIイノベーションを牽引する上で重要なインフラ基盤を提供する。
影響分析・編集コメントを表示
影響分析
この記事は、AIエージェントの実用化における重要なインフラ課題に焦点を当て、NVIDIAが提供する具体的なソリューションを示している。自律型AIの本格的な展開に向けて、企業が直面する運用上のボトルネックを解消する可能性があり、AI業界の実用フェーズへの移行を加速させる影響が期待される。
編集コメント
NVIDIAの開発者ブログからの情報であり、自社製品のPR色が強いが、自律型AIエージェントの実運用における核心的な課題とその解決策を明確に示しており、業界関係者にとって実用的な洞察を提供している。
image自律型AIエージェントは、AI革新の次の波を牽引しています。これらのエージェントは、複数の通信チャネルを使用する長時間実行タスクを管理することが多く...
原文を表示
Autonomous AI agents are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute.
NVIDIA DGX Spark provides the performance necessary for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw, part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron.
This post discusses several important aspects of system capabilities and performance that are necessary to power always-on autonomous agents and explains why NVIDIA DGX Spark is an ideal desktop platform for autonomous AI.
Inference for autonomous AI agents
Agentic tools often need to process massive context windows. OpenClaw, for example, is an AI agent runtime that requires these large context windows to comprehend requests and environments, and to think through the best approach to a problem.
Prompt processing (prefill) throughput can be thought of as the reading comprehension phase of inference and can easily become a bottleneck with a slow GPU. It’s common to see autonomous agents easily using contexts of 30K-120K tokens (100K tokens is equivalent to reading Harry Potter and the Philosopher’s Stone), with some agents processing 250K tokens for complex requests.
Table 1 shows how a potential agent or subagent performs with a large context window, (128K/1K of ISL/OSL).
Model End-to-end latency (s)Prompt processing latency(s) Prompt processing throughput (tok/s)Token generation throughput (tok/s)NVIDIA Nemotron 3 Super 120B NVFP4 with TensorRT LLM 99442,85518Qwen3.5 35B A3B FP8 with vLLM 73 41 3,080 35.75 Qwen3 Coder Next 80B FP8 with vLLM 89 54 2,390 28.95 Table 1. Performance representative of 128K tokens input prompt and response of 1K tokens, at batch size 1
When moving from a single subagent to multiple subagents, simultaneous workloads must scale without impacting performance significantly. NVIDIA DGX Spark effectively handles high concurrency in this scenario.
Thanks to the power of the NVIDIA Grace Blackwell Superchip, the GPU can parallelize multiple subagents. Two, four, or even eight subagents concurrently working through requests can make use of the strong concurrency capabilities in DGX Spark.
With support from frameworks that handle concurrency well (such as NVIDIA TensorRT LLM, vLLM, and SGLang), multiagent workloads run smoothly on NVIDIA DGX Spark. For tasks with 32K ISL of 1K OSL, completing four times as many tasks requires only 2.6x more time, while prompt processing throughput increases by about 3x (Table 2).
NVIDIA DGX Spark is an ideal platform for OpenClaw development. With NVIDIA OpenShell, you can run autonomous, self-evolving agents more safely. Get started running OpenClaw locally on NVIDIA DGX Spark.
Concurrency(# of simultaneous tasks) End-to-end latency (s)Median TTFT (s)Prompt processing throughput (tok/s)Token generation throughput (tok/s)Lower is betterHigher is better1 359 3,261 38 2 54 12 5,363 47 4 91 15 9,616 53 Table 2. Performance representative of Qwen3 Coder Next in FP8 in vLLM for a 32K tokens input prompt and response of 1K tokens at different concurrency levels
Scale inference and fine-tuning on up to four NVIDIA DGX Spark nodes
Larger models and multiple subagents require more memory to load and execute. Until now, NVIDIA DGX Spark has supported scaling up to two nodes, increasing the available memory from 128 GB on one node to 256 GB on two nodes. This capability has now been increased to up to four DGX Spark nodes.
DGX Spark also now supports several execution topologies, each tailored to different goals through the low latency of RoCE communication enabled by ConnectX-7 NICs.
One DGX Spark node: Ideal for low latency, large context size inference, fine-tuning up to 120B parameters, and local agentic workloads
Two DGX Spark nodes: Balanced scaling for faster fine-tuning and larger models, as well as support for up to 400B-parameter inference
Three DGX Spark nodes in a ring: Ideal for fine-tuning larger models or small training jobs
Four DGX Spark nodes with RoCE 200 GbE switch: Local inference server ideal for state-of-the-art models up to 700B parameters, communication intensive workloads, and local AI factory operations
Inference can scale up linearly on DGX Spark when internode communication is minimal. When work is largely independent per GPU, the results are aggregated once at the end rather than continuously. In this case, DGX Spark nodes can run in parallel with low synchronization overhead.
For example, a reinforcement learning (RL) workload in NVIDIA Isaac Lab can run many simulations independently on each node. Results are collected in a single step, yielding near-linear scaling across multiple DGX Spark nodes.
Inference scaling is less than linear when the workload requires frequent, fine-grained communication between nodes. During LLM inference, model execution occurs layer by layer, with continuous synchronization required across nodes. Partial results from different DGX Spark nodes must be exchanged and merged repeatedly, which introduces significant communication overhead. As additional nodes are added, this overhead becomes increasingly dominant, limiting scaling efficiency.
Parallelism for AI agents: Inference at scale
Tensor parallelism enables efficient inference sharing across multiple nodes to fit the model while minimizing communication overhead. Scaling from two to four DGX Spark nodes provides excellent parallelism capabilities. This is thanks to the low-latency ConnectX-7 NICs, scaling in time per output token (TPOT) almost linearly with ~2x with TP2 (two nodes) and ~4x with TP4 (4 nodes) in inference use cases.
Table 3 shows how a single agent performs an inference job shared across multiple nodes.
1 DGX Spark node TP1 (ms)2 DGX Spark nodes TP2 (ms)4 DGX Spark nodesTP4 (ms) TTFT (lower is better) 33,41521,384 15,552TPOT (lower is better) 269 13372Table 3. Scaling Llama 3.3 70B Instruct NVFP4 on TensorRT LLM with one, two, and four DGX Spark nodes (32K input, 1K output, batch size 1)
Several models that are popular in the context of OpenClaw—including Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B—can benefit from stacking multiple DGX Spark units, increasing the available memory.
Near-linear fine-tuning
Fine-tuning and similar workloads can be significantly parallelized with close-to-linear performance scaling when the model instance can fit on one GPU. This reduces the communication overhead to only gradient synchronization at the end of each step.
An RL workload in NVIDIA Isaac Lab or Nanochat can benefit from this performance scaling. Isaac Lab can accommodate several copies of each environment on each DGX Spark. For each step, Isaac Lab communicates to the other nodes to synchronize the training, achieving linear speedup through clustering.
1 DGX Spark node TP12 DGX Spark nodes TP24 DGX Spark nodes TP4 Collection time 12.1 s 11.4 s 10.4 s Learning time 40.9 s41.4 s 42.3 s # environments 1,024 1,024 1,024 FPS 630 12412,520Table 4. Scaling of Isaac Lab reinforcement learning performance on one, two, and four DGX Spark nodes
HW configuration Total token throughput(tok/s) Speedup versus 1 DGX Spark node 1 DGX Spark node ~18,4001 2 DGX Spark nodes ~35,900 24 DGX Spark nodes ~74,600 4 Table 5. Scaling of Nanochat fine-tuning performance from one to four DGX Spark nodes (model depth of 20 layers, batch size of 32 per node, full context attention)
When using distributed data parallel (DDP), fine-tuning can similarly benefit from the low communication overhead. In this case, each node can fully host a copy of the model and communicate with the other nodes once per step.
Nodes Samples/step Batch size Samples/s Speedup 1 DGX Spark node 15.73 32 2.03 – 3 DGX Spark nodes 15.69 96 6.12 3x Table 6. Scaling one DGX Spark to three DGX Spark nodes, each node has the full model of Qwen3 4B (batch size of four samples per device, BF16 quantization)
Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows
Cloud solutions are required when moving from prototyping to large-scale production deployment. This section explains how workloads developed on DGX Spark can be deployed in the cloud.
Tile IR and cuTile Python enable seamless kernel portability from DGX Spark development environments to cloud deployment on NVIDIA Blackwell data center GPUs, with minimal code changes. Using TileGym, developers can:
Write kernels once using cuTile Python DSL
Test and validate on DGX Spark
Deploy to NVIDIA Blackwell B300/B200, NVIDIA Hopper, or NVIDIA Ampere with minimal code changes
Leverage TileGym preoptimized transformer kernels as drop-in replacements
End-to-end inference performance
Beyond kernel-level analysis, we benchmarked complete Qwen2 7B inference using cuTile kernels on both platforms to demonstrate cross-architecture performance portability. Table 7 shows the configuration; Table 8 shows the platform specification.
Parameter Value Model Qwen2 7B Input length 2,189 tokens Output length 128 tokens Batch sizes 1, 2, 4, 8, 16, 32, 64, 128 Table 7. Model and parameter specifications showing Tile IR usage
Specification NVIDIA DGX Spark (Dev) NVIDIA Blackwell B200 (Cloud) Compute capability SM 12.1 SM 10.0 SM count 48 148 SM frequency 2.14 GHz ~1.0 GHz Memory type LPDDR5X (Unified) HBM3e Memory bandwidth 273 GB/s ~8 TB/s Table 8. Platform specifications of NVIDIA DGX Spark and NVIDIA B200 as local and cloud examples
Platform-specific configuration
While the kernel source code remains identical across platforms, optimal performance is achieved through platform-specific configurations (Tile and Occupancy). For the FMHA kernel example, Table 9 shows how these configurations adapt to different hardware characteristics. Tile IR compiles to architecture-specific PTX/SASS at JIT, automatically leveraging platform-specific features like Tensor Memory Accelerator (TMA) using the appropriate configuration.
Platform TILE_M TILE_N Occupancy Rationale NVIDIA DGX Spark (SM 12.1) 64 64 2 Smaller tiles 48 SMs, unified memory NVIDIA B200 (SM 10.0) 256 128 1 Large tiles maximize HBM3e throughput NVIDIA B200 (alt) 128 128 2 Higher occupancy, balanced parallelism Table 9. Platform-specific cuTile configuration across NVIDIA DGX Spark and NVIDIA B200
Roofline analysis and comparison of Tile IR kernel performance
Roofline analysis in NVIDIA Nsight Compute is a powerful visual performance framework used to determine how well an application is utilizing hardware capabilities. As a developer, roofline analysis helps you figure out whether your code is “slow” and shows why it may be hitting a performance ceiling.
Analysis of the roofline model suggests that the kernel scales effectively relative to the respective roofline, demonstrating that Tile IR is a viable option to scale workloads. The kernel considered is the attention decode kernel and the kernel is optimized using Tile IR.
Figure 1. Roofline analysis in NVIDIA Nsight Compute shows how Tile IR kernel performance scales on NVIDIA B200 and NVIDIA DGX Spark relative to the theoretical peak roofline of each GPU
Performance scaling and optimization headroom
In Figure 1, the vertical positioning of the data points on the y-axis confirms that the kernel achieves higher hardware utilization on NVIDIA B200. Specifically, the vertical proximity of the blue dot to the NVIDIA B200 GPU memory roofline is greater than that of the green dot to the Spark roofline.
This roofline analysis indicates additional opportunities for optimization, and that algorithmic or memory optimizations of NVIDIA DGX Spark will also benefit NVIDIA B200 GPUs.
Cache utilization and arithmetic intensity
Analysis of the x-axis reveals that the blue dot is positioned to the right of the green dot, signifying that the B200 achieves superior Hardware Arithmetic Intensity.
Cache efficiency: While the larger cache capacity of NVIDIA B200 GPU provides the theoretical foundation for reducing DRAM traffic, hardware alone is insufficient. The software must be architected to exploit these resources.
Kernel portability: The rightward shift indicates that Tile IR kernels successfully leverage the NVIDIA B200 expanded cache hierarchy on migration.
Future Tile IR kernel optimizations aimed at increasing arithmetic intensity on Spark—moving the data point further right along the x-axis—will inherently result in compounded performance benefits when running on various cloud GPUs.
Automated cross-platform autotuning
Currently, optimal configurations are selected based on platform characteristics. Future releases of cuTile will support fully automated cross-platform autotuning. The autotuner will discover optimal tile sizes and occupancy settings for each target architecture automatically, enabling transparent performance portability without any manual configuration.
Get started with NVIDIA DGX Spark
As AI systems become more sophisticated, NVIDIA DGX Spark provides the flexible, multitopology execution environment required to deploy them efficiently. From multiagent inference to trillion-parameter serving, from fine-tuning to Tile IR cross-cloud pipelines, DGX Spark delivers both scalability and efficiency.
The result is a unified platform where enterprises can deploy and scale AI workloads—without rewriting infrastructure for every model or runtime.
Learn more with the following playbooks:
Connect Three DGX Spark in a Ring Topology
Connect Multiple DGX Spark through a Switch
Start building on NVIDIA DGX Spark.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み