データサイエンティストのように考えるエージェントを構築:再利用可能なツール生成でDABStepで1位を獲得した方法
NVIDIAチームが開発したデータ分析エージェント「NeMo Data Explorer」は、再利用ツール生成とマルチフェーズ設計によりDABStepベンチマークで1位を達成し、表形式データの自動分析と高速推論を実現した。
キーポイント
DABStepベンチマークで1位達成と30倍高速化
既存のClaude Codeベースラインを大幅に上回り、データエージェントベンチマークでSOTAを記録した。
知識構築と推論を分離するマルチフェーズ設計
基盤知識の事前構築と高速推論を分ける戦略により、複雑なマルチステップ推論の精度と速度を両立させた。
再利用可能なツール生成による自律型分析
データサイエンティストの思考プロセスを模倣し、コード自動生成・実行・可視化をシームレスに連携させるアーキテクチャを採用した。
用途別エージェントループの最適化
探索的データ分析にはReActエージェントとJupyterツールを、表形式QAには専用インタプリタとリトリーバーを組み合わせたTool Callingエージェントを搭載した。
Open-ended EDA Workflow
A ReAct Agent translates user inputs into notebook tool calls, while a VLM-integrated handler converts visual outputs into textual analysis for informed agent responses.
DABStep Benchmark Design
The benchmark features 450 financial tasks, with 84% classified as hard multi-step reasoning requiring documentation reading, code generation, and cross-referencing.
3フェーズによる学習・推論の分離
重たい前処理(学習)と高速実行(推論・オフライン反省)を分離し、人間のようなデータサイエンティストの作業フローを模倣して効率化を図る。
影響分析・編集コメントを表示
影響分析
この成果は、LLMエージェントが単なるテキスト検索を超えて、構造化データ分析という実務の核心領域に進出できることを示した。特にDABStepベンチマークでの1位と高速化は、金融・研究・ビジネス分析などの現場におけるデータ探索プロセスの自動化と意思決定速度を劇的に向上させる可能性を秘めている。
編集コメント
表形式データの自動分析は実務需要が高い分野であり、今回のマルチフェーズ設計とツール生成の仕組みが業界標準となる可能性を秘めている。ただし、実際のデータ品質やドメイン固有の制約への対応力については、追加の実証が必要だろう。
ライブ推論ループのボトルネックを発生させることなく高品質を確保するため、重要な品質管理を完全にオフラインに移行します。このフェーズでは、教師なしレビュアーとして機能する大規模モデル(OpusやSonnet 4.6など)によって駆動される、2つの強力なLLM評価技術—リフレクション(reflection)とグループ一貫性(group-consistency)—に依存しています。
リフレクション(reflection)は、モデルがエージェントが生成したコードと推論を振り返り、そのパフォーマンスを監査するプロセスです。厳しい問いかけを行います: エージェントはhelper.pyライブラリを効果的に活用したか?プロンプトに忠実に従ったか?コードに明らかな誤りはないか?
一方、グループ一貫性(group-consistency)は、類似したテスト問題群にわたる複数の候補ソリューションを分析し、エージェントのロジックが安定していることを確認することを含みます。エージェントが同じタイプの問題を矛盾する方法で解決した場合、オフラインモデルはその不一致にフラグを立て、どのアプローチが実際に正しいかを推論します。これらの計算負荷の高いチェックをオフラインに移行することで、推論フェーズの速度を犠牲にすることなく、データを深く分析できます。
ループを閉じる: より高速な推論のための洞察の注入
このオフラインリフレクション中に生成された洞察は、単なる分析のためだけではなく、学習ループを閉じるためにアーキテクチャに積極的にフィードバックされます。テストデータから主要なパターン、エッジケース、潜在的な落とし穴を抽出することで、大規模モデルはこれらの学習をまとめ、将来の推論フェーズのためのシステムプロンプトに直接注入します。軽量な推論エージェントは既にこれらの事前計算された洞察を開始プロンプトに保持しているため、遅く計算コストの高いオンラインリフレクションや一貫性チェックの必要性が完全に排除されます。その結果、推論フェーズは非常に高速でトークン効率が良く、各オフラインレビューごとに精度を継続的に向上させ、複利効果のように積み上げていきます。
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer + haiku 4.5
claude code + opus 4.5
DataPilot from AntGroup
DS-STAR from Google AI
このアーキテクチャを検証するため、私たちの3フェーズ「NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer」アプローチ(推論には軽量なHaiku 4.5を使用)を、大規模なOpus 4.5を使用してすべてのタスクをゼロから解決しようとする標準ベースライン「Claude Code」と比較してベンチマークしました。結果は、私たちの方法論の大幅な効率向上を強調しています。私たちの推論エージェントは事前構築されたhelper.pyライブラリに依存しているため、タスクを非常に高速に解決します—タスクあたりわずか20秒、非常に簡潔な1,870トークンを生成します。対照的に、ゼロからのアプローチはタスクあたり苦痛を伴う10分を要し、コード長は5,011トークンに膨れ上がります。最も印象的なのは、この30倍の高速化が複雑な推論を損なわないことです。大規模なOpusモデルが「簡単な」タスクでわずかに上回った一方(90.2対87.5)、私たちのアプローチは「難しい」タスクを完全に支配し、ベースラインの66.93に対して89.95を記録しました。これは、事前学習とコード抽象化に時間を投資することで、より小さく高速なモデルでさえ、複雑な多段階問題においてより大規模なモデルを凌駕できることを証明しています。
このパフォーマンスにより、私たちのアーキテクチャは公式DABStepリーダーボードで1位を獲得しました。NVIDIA KGMON (NeMo Agent Toolkit) Data Explorerアプローチは、複雑な問題においてAntGroupのDataPilotとGoogle AIのDS-STARを大幅に上回りました。「難しい」タスクで89.95のスコアを記録し、DataPilot(87.57)を上回り、DS-STARのスコア(45.24)をほぼ倍にしました。ベンチマークの84%が難しいレベルのタスクで構成されていることを考えると、このカテゴリーでの私たちの優位性が、全体として最良のソリューションとしての地位を直接確立しています。これらの結果は、私たちの3フェーズ方法論が、効率的かつ厳密な表形式推論のための現在の最先端技術として確立されたことを示しています。
結論: データ集約型研究の新しいパラダイム
NVIDIA NeMo Agent Toolkitを基盤として構築されたData Explorerエージェントは、構造化された表形式データの自動化されたデータ分析において重要な前進を表しています。柔軟なエージェントループ—オープンエンドの探索的データ分析のためのReActループ(ReAct loop)とルールベースの表形式QAのための多フェーズシステム—を採用することで、このエージェントは複雑な多段階推論タスクを処理するのにユニークな位置を占めています。挑戦的なDABStepベンチマークでの多フェーズアプローチの成功、特に再利用可能で一般化された関数を生成する積極的な学習ループは、基礎知識構築を迅速な推論から分離する戦略を検証しています。Data Explorerは単純なクエリ応答を超えて、経験豊富なデータサイエンティストの運用ワークフローを体現し、スケーラブルで高品質な洞察を提供し、LLM駆動エージェントによるデータ集約型研究の新しいパラダイムを確立します。
独自のデータ探索エージェントを構築する準備はできていますか?NVIDIA Launchableで始めましょう。例は近日公開予定です!





原文を表示
Back to Articles Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation
Upvote 1
The world of data is vast, but quantitative information is often sparse or unavailable in text form online, presenting a significant challenge for deep research agents. This post shares an architecture, NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, for building autonomous data analysis agents, developed by the NVIDIA Kaggle Grandmasters (KGMON) LLM Agent Research Team. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer project introduces an agent specialized for dataset exploration and analysis, designed to handle the complexities of multi-step reasoning, tool calling, and iterative data analysis. Notably, our approach establishes new state-of-the-art (SOTA) performance on the Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, ranking 1st place with a 30x speedup over the claude code baseline.
The success of the multi-phase approach on the challenging DABStep benchmark validates the strategy of separating foundational knowledge building from rapid inference.
Motivation: Bridging the Gap in Data Analysis
Deep research agents, especially those relying on internet text search, fall short when dealing with structured, tabular data that requires complex, multi-step queries.
Our core motivation is to create an agent that excels in:
Iterate faster on analysis through automatic code generation and execution.
Crack complex tabular questions with multi-step reasoning and tool use.
Make sense of large unstructured contexts using semantic search.
Stay oriented in experiments by generating and interpreting visualizations automatically.
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer aims to deliver capabilities including automatic open-ended exploratory data analysis, tabular data Q&A, predictive modeling, and forecasting.
The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture
In NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, we implement different agent loops for different use cases. The architecture leverages the NVIDIA NeMo Agent Toolkit to drive these loops, utilizing tools designed specifically from a data scientist's perspective. For open-ended exploratory data analysis, the system pairs a ReAct agent with a Jupyter Notebook tool, allowing for continuous, bi-directional interaction. Alternatively, for multi-step rule-based tabular data QA, the architecture utilizes a Tool Calling Agent. This agent interacts with a distinct, multi-part suite of specialized tools to accomplish its structured tasks: a stateful Python interpreter, a retriever, and a file structure detector.

Open-ended Exploration and Tabular Data QA
Currently the NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer focuses on two primary applications:
- Open-ended Exploratory Data Analysis (EDA)
The figure below illustrates the architecture for open-ended exploratory data analysis driven by a ReAct Agent. The workflow begins with the user mounting a dataset and sending questions or instructions to the ReAct Agent, which translates these inputs into specific tool calls. These calls are sent to the Notebook Manipulation Tools, a suite capable of standard operations like creating notebooks, adding code, and running cells. Once the tools execute the commands, the raw output flows into the Tool Output Handler. A critical feature of this handler is its integration with a Vision-Language Model (VLM); if the tool output includes a visual plot, the handler sends it to the VLM to generate a textual description and suggestions for improving the plot's aesthetics and information richness. The handler then replaces the visual plot with this text-based analysis, sending the processed tool output back to the ReAct Agent so it can formulate an informed response to the user.

- Multi-Step Rule-based Tabular Data QA
This addresses hard questions that require multi-step reasoning and tool calling against a tabular dataset. We focus on Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, which comprises 450 total tasks specifically focused on the Financial Payments Sector. The benchmark process is structured into three main components:

The Context & Query include questions and heterogeneous data sources (like CSV and JSON files), alongside a markdown manual detailing domain logic and rules. The Benchmark Tasks categorizes the workload into Easy Tasks (16%), which are basic single-dataset queries, and Hard Tasks (84%), which require complex, multi-step tool-augmented reasoning. These hard tasks involve reading documentation, generating code (such as SQL or Pandas), and cross-referencing data to calculate an answer, where web search offers little to no useful help. Finally, the Evaluation phase measures success using an Exact Text Match with strict formatting requirements, expecting a JSONL output that includes both the agent_answer and the reasoning_trace.
Cracking DABStep: A Multi-Phase Approach
To achieve State-of-the-Art (SOTA) results on DATStep, we need to separate the heavy lifting from the fast execution. The system is split into three distinct phases: a Learning phase where the agent uses general skills and ground truth data to forge reusable, specialized tools; an Inference phase that applies these tools to solve new questions rapidly; and an Offline Reflection phase that reviews the outputs to generate deeper insights. This mimics how a human data scientist operates—spending significant effort upfront to build a robust toolkit so that future tasks become efficient and scalable.

Phase 1: The Learning Loop
In the Learning phase, we deploy a heavyweight model (like Opus 4.5/4.6) in a multi-pass loop equipped with a full arsenal of tools, including a stateful Python interpreter, bash tools, and file structure detectors. By tackling a batch of representative tasks (e.g., Tasks 1 through 10) and validating them against ground truth answers, the agent builds a comprehensive mental model of the dataset. It then synthesizes these individual python scripts into one master solution, ultimately distilling it down to a highly optimized library of reusable functions (helper.py) and a concise set of few-shot examples, which demonstrates how helper functions are used to solve the questions in the dev split (training set).

Recognizing Interconnected Tasks & Optimizing Sub-Solutions Across the Board
The core insight driving this approach is that complex data questions rarely exist in isolation. As shown in the merchant fee examples, different tasks often share the exact same foundational data operations. For instance, computing a specific transaction fee for a specific month (Task 2) requires the exact same initial steps—fetching merchant info and finding fee data—as simply listing the applicable fee IDs (Task 1). Recognizing and mapping this overlap is the key to building a modular, DRY (Don't Repeat Yourself) system.

Instead of writing isolated, brittle scripts for every new question, the agent actively searches for the most robust logic. If "Version 1" of a function works perfectly for Task 1 but fails when applied to the slightly different constraints of Task 2, the agent recognizes the flaw. By actively testing candidate functions via the Python interpreter against the ground truth of multiple interconnected tasks, the agent iteratively discovers a "Version 2" that successfully generalizes across the entire batch.
Refactoring and Packaging

Once the optimal, generalized logic is found, the agent refactors the bulky independent scripts into a clean, unified architecture. The complex data extraction and computation steps are packaged into the centralized helper.py library. Consequently, the actual code needed to answer any specific question shrinks dramatically. The final task solutions transform from long, complex scripts into lightweight instructions that simply import and execute the right tools from the helper library.
Phase 2: Fast and Lean Inference

With the foundational code written, the Inference phase shifts to a smaller, faster model (like Haiku 4.5) running a single-pass loop. Because the complex domain logic is already securely housed in helper.py, the inference agent only needs a basic Python interpreter to do its job. To keep token costs and latency to an absolute minimum, the context window is aggressively pruned: the agent is fed only the function signatures (not the underlying code) alongside a streamlined system prompt, allowing it to efficiently orchestrate the pre-built tools to solve unseen tasks.
Phase 3: Unsupervised Offline Reflection

To ensure high quality without bottlenecking the live inference loop, we move critical quality control entirely offline. This phase relies on two powerful LLM evaluation techniques—reflection and group-consistency—driven by a heavyweight model (like Opus or Sonnet 4.6) acting as an unsupervised reviewer.
Reflection is the process where the model looks back at the agent's generated code and reasoning to audit its performance. It asks the tough questions: Did the agent effectively utilize the helper.py library? Did it follow the prompt faithfully? Are there any obvious mistakes in the code?
Group-consistency, on the other hand, involves analyzing multiple candidate solutions across groups of similar test questions to ensure the agent's logic remains stable. If the agent solves the exact same type of question using conflicting methods, the offline model flags the discrepancy and reasons through which approach is actually correct. By moving these computationally heavy checks offline, we can deeply analyze the data without sacrificing the speed of the Inference phase.
Closing the Loop: Injecting Insights for Faster Inference
The insights generated during this offline reflection aren't just for analytics—they are actively fed back into the architecture to close the learning loop. By extracting key patterns, edge cases, and potential pitfalls from the test data, the heavy model compiles these learnings and injects them directly into the system prompt for future Inference phases. Because the lightweight inference agent already holds these pre-calculated insights in its starting prompt, we completely eliminate the need for slow, computationally expensive online reflection or consistency checks. The result is an Inference phase that remains blazingly fast and token-efficient, while continuously compounding its accuracy with every offline review.
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer + haiku 4.5
claude code + opus 4.5
DataPilot from AntGroup
DS-STAR from Google AI
To validate this architecture, we benchmarked our three-phase "NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer" approach (using the lightweight Haiku 4.5 for inference) against a standard baseline using "Claude Code" with the heavyweight Opus 4.5, which attempts to solve every task from scratch. The results highlight the massive efficiency gains of our methodology. Because our inference agent relies on the pre-built helper.py library, it solves tasks at blazing speed—taking only 20 seconds per task and generating a highly concise 1,870 characters. In stark contrast, the from-scratch approach takes a painstaking 10 minutes per task and bloats the code length to 5,011 chars. Most impressively, this 30x speedup doesn't compromise complex reasoning. While the heavy Opus model slightly edged out on "Easy" tasks (90.2 vs. 87.5), our approach completely dominated the "Hard" tasks, scoring an 89.95 compared to the baseline's 66.93. This proves that investing time in upfront learning and code abstraction allows even smaller, faster models to outsmart heavier models on complex, multi-step problems.
This performance secured our architecture 1st place on the official dabstep leaderboard. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer approach significantly outperformed AntGroup's DataPilot and Google AI's DS-STAR on complex problems. With a score of 89.95 on "Hard" tasks, our system surpassed DataPilot (87.57) and nearly doubled DS-STAR's score (45.24). Given that 84% of the benchmark consists of hard-level tasks, our dominance in this category directly secures our position as the best overall solution. These results establish our three-phase methodology as the current state-of-the-art for both efficient and rigorous tabular reasoning.
Conclusion: A New Paradigm for Data-Intensive Research
Building on top of NVIDIA NeMo Agent Toolkit, the Data Explorer agent represents a significant step forward in automated data analysis for structured tabular data. By employing flexible agent loops—a ReAct loop for open-ended exploratory data analysis and a multi-phase system for rule-based tabular QA—the agent is uniquely positioned to handle complex, multi-step reasoning tasks. The success of the multi-phase approach on the challenging DABStep benchmark, particularly the proactive learning loop that generates reusable, generalized functions, validates the strategy of separating foundational knowledge building from rapid inference. Data Explorer moves beyond simple query-answering to embody the operational workflow of a seasoned data scientist, delivering scalable, high-quality insights and establishing a new paradigm for data-intensive research driven by LLM-powered agents.
Ready to build your own data exploration agent? Get started with NVIDIA Launchable. Examples will be released soon!





関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み