Hugging Face Blog·2026年3月12日 12:53·約3分

NVIDIAのAI-QがDeepResearch Bench IとIIで首位を獲得

#AIエージェント #研究エージェント #マルチエージェント #ベンチマーク #オープンアーキテクチャ #NVIDIA

TL;DR

NVIDIAのAI-Q深層研究エージェントが、DeepResearch Bench IとIIの両方で首位を獲得し、オープンでモジュラーなアーキテクチャにより、企業が所有・カスタマイズ可能な最先端のエージェント研究を実現した。

AI深層分析2026年3月12日 13:41

重要/ 5段階

深度40%

キーポイント

ベンチマーク首位獲得

NVIDIA AI-Q深層研究エージェントが、深層研究エージェントを評価する主要な2つのベンチマーク「DeepResearch Bench I（55.95点）」と「DeepResearch Bench II（54.50点）」の両方で第1位を達成した。

オープンでモジュラーなアーキテクチャ

AI-Qは、企業が所有・検査・カスタマイズ・ユースケースごとに設定できる、完全にオープンでモジュラーなアーキテクチャを提供する設計図（ブループリント）である。

マルチエージェントアーキテクチャ

深層研究エージェントは、プランナー、リサーチャー、オーケストレーターから構成されるマルチエージェントアーキテクチャを採用し、NVIDIA NeMo Agent ToolkitとファインチューニングされたNVIDIA Nemotron 3 Superモデル上に構築されている。

両ベンチマークでの優位性の意味

DeepResearch Bench Iは報告書の品質（包括性、洞察の深さ、指示への従順性、読みやすさ）を、DeepResearch Bench IIは情報想起、分析、提示に関する70以上の詳細なルーブリックで評価し、両方で優れることは、洗練された報告書と基礎となる検索・推論の正確さの両方を示す。

開発者アクセス可能な最先端研究

単一の設定可能なスタックが両ベンチマークでリードすることは、開発者がアクセス可能なモデルとツールが最先端のエージェント研究を実現できることを示している。

AI-Qのコアスタック構成

NVIDIA NeMo Agent Toolkitによるワークフロー構築、LangChain DeepAgentsによるマルチフェーズ計画・研究・オーケストレーションフロー、NVIDIA Nemotron 3 LLMによるエージェントパイプラインで構成される。

AI-Q成功の4つの要素

マルチエージェントアーキテクチャ、ファインチューニングされたNemotron 3 Superモデル、長期的信頼性のためのカスタムミドルウェア、アンサンブル研究者とレポートリファイナーのオプション層。

影響分析・編集コメントを表示

影響分析

この成果は、オープンでポータブルな深層研究の実現に向けた重要な一歩を示しており、企業が自社データに基づいたカスタマイズ可能なAI研究エージェントを構築・運用できる道筋を開く。開発者アクセス可能なツールによる最先端性能の実証は、AIエージェント技術の民主化と実用化を加速させる可能性がある。

編集コメント

ベンチマークの二冠達成は印象的だが、より重要なのは「オープンでモジュラーなアーキテクチャ」という設計思想。企業がAI研究エージェントを自社所有・カスタマイズできる道筋を示した点が、実用化への大きな後押しとなる。

推論を考慮したリトライ: 推論を行うLLM（大規模言語モデル）は、ツール呼び出しや最終応答を生成せずに思考トークンだけを出力することがあり、これによりエージェントループが暗黙的に終了してしまいます。ミドルウェアはこれを検出し、コンテキスト内の推論を保持した状態でリトライを行います。

予算の強制実行: 各エージェントおよびサブエージェントは独自のツール呼び出し上限を持ちます。制限に達すると、ミドルウェアはLLMに対し、まず情報を統合（synthesize）するよう促し、その後ツールの使用を完全に停止してテキストのみの応答を強制します。

レポート検証: 最終出力を返す前に、ミドルウェアはレポートの最小長とセクション構造をチェックします。不完全なレポートに対しては、継続を促すプロンプトを用いてリトライが行われます。

各ミドルウェアは、エージェントのトレースで観測された失敗パターンに対処します。これらが連携することで、長期的な実行の信頼性が確保されます。

アンサンブル: 有効化されている場合、N個の独立したディープリサーチパイプラインが並列で実行されます。LLMがすべてのN個の出力を読み取り、1つを構造的なベースとして選択し、他の出力からユニークな内容を統合します。アンサンブルは、単一のパイプラインよりも広範なエビデンスカバレッジを実現し、包括性と情報想起を直接向上させます。校正パスによりプロセスの痕跡が除去され、出力はあたかも単独の著者による作品のように読めるようになります。

事後リファイナー: オプションの最終レポートリファイナーステップでは、構造化された指示を用いてレポートを精査し、曖昧な主張を定量化し、エンティティカバレッジを深め、冗長な部分を削除し、リスクを根拠づけ、比較表を作成し、因果推論を強化することができます。この書き換え用プロンプトは、フロンティアLLMのみを使用して当社のパイプラインから生成された参照レポートに対して、自己教師ありメタ学習を通じて導出されました。

NVIDIA AI-Qは、単一のスタック——NVIDIA NeMo Agent Toolkit上に構築されたマルチエージェントディープリサーチャー、ファインチューニングされたNVIDIA Nemotron 3モデル、カスタムミドルウェア、そして最高のレポート品質が求められる場合のオプションのアンサンブルとリファイナー——により、DeepResearch BenchとDeepResearch Bench IIの両方で第1位を獲得しました。このスタックはオープンで再現可能であり、お客様のニーズに合わせて設定可能です。透明性や制御性を損なうことなく、最先端の結果を実現します。

詳細については、2026年3月16日週にサンノゼで開催されるNVIDIA GTCにご参加ください。

S81706 - 評価駆動開発: 信頼性の高いエージェント構築のベストプラクティス

DLIT81725 - 評価駆動設計によるプロダクションエージェントの開発 Dhruv Nandakumar US

S81570 - データから意思決定へ: ビジネス知識によるAIエージェントの実現

S81569 - 自己コーディングエージェント: アーキテクチャ、データフライホイール、自律的コード修復

S81789 - オープンソースAIが形作る次世代の知的デジタルワーカー

原文を表示

Back to Articles How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II

Upvote 1

Contributors: Raja Biswas, Divyansh Jain, Ivan Sorokin, Alessio Devoto, Chantal D Gama Rose, Ajay Thorve, David Austin, Jean-Francois Puget

NVIDIA AI-Q deep research agent recently achieved first place on both DeepResearch Bench (55.95) and DeepResearch Bench II (54.50), the two primary benchmarks for evaluating deep research agents. This marks a meaningful step for open, portable deep research. One configurable stack leading on both shows that developer accessible models and tooling can power state-of-the-art agentic research.

What sets AI-Q apart? AI-Q is an open blueprint for building AI agents that reason over enterprise and web data to deliver well-cited responses. AI-Q provides a fully open and modular architecture that enterprises can own, inspect, customize, and configure per use case. The deep researcher is one workflow within the larger AI-Q blueprint that includes intent routing, query clarification, and shallow research. The deep researcher adopts a multi-agent architecture consisting of planner, researcher, and orchestrator built on NVIDIA NeMo Agent Toolkit and fine-tuned NVIDIA Nemotron 3 Super models, with an optional ensemble and report refiner for maximum report quality. One stack - flexible by design, tunable to your needs.

Why Winning Both Benchmarks Matters

DeepResearch Bench I and II evaluate research agents in complementary ways.

DeepResearch Bench scores report quality against a reference report along comprehensiveness, depth of insight, instruction-following, and readability dimensions. Doing well here rewards polished, well-structured narratives and strong synthesis.

DeepResearch Bench II uses 70+ fine-grained, binary rubrics per task to check whether an agent retrieves the right information (Information Recall), synthesizes it into higher-level analysis (Analysis), and presents findings clearly (Presentation). Doing well here rewards granular factual correctness and analytical rigor.

Leading on both benchmarks means the AI-Q deep researcher produces polished well-cited reports and gets the underlying retrieval and reasoning right.

Architecture at a Glance

The AI-Q deep researcher architecture behind both results centers on three components: an orchestrator that coordinates the research loop, a planner that maps the information landscape and designs an evidence-grounded research plan, and a researcher that dispatches parallel specialists to gather and synthesize evidence across multiple analytical lenses. Each agent can be powered by a different LLM. An optional ensemble runs multiple agents in parallel and merges their outputs for maximum report quality and coverage of information. Figure 1 shows the full architecture.

Figure 1. AI-Q deep researcher: orchestrator, planner, and researcher pipeline (right) with optional ensemble (left).

Core Stack: NVIDIA and Deep Research

The same underlying stack powers both leaderboard submissions: open, reproducible, and built on:

NVIDIA NeMo Agent Toolkit for workflow wiring, function registration, and evaluation. The NeMo Agent Toolkit open source library provides config-driven composition of LLMs and tools and the ability to plug in different agent graphs.

LangChain DeepAgents for the multi-phase planner–researcher–orchestrator flow with subagent middleware where applicable.

NVIDIA Nemotron 3 LLMs powering the agent pipeline. Nemotron models can be fine-tuned to excel at research synthesis and long-horizon tool calling. Can be served via NVIDIA Build or NVIDIA NIM for model inference.

The core is always multi-step research (plan → gather → synthesize), web search (Tavily) and academic paper search (Serper), and citation-backed reports. Optionally, an ensemble layer and report refiner can be added on top for maximum report quality.

Key Ingredients in AI-Q

Four ingredients were central to the result:

Multi-agent architecture with evidence-grounded planning and specialist researchers, built on NVIDIA NeMo Agent Toolkit and LangChain DeepAgents.

Fine-tuned NVIDIA Nemotron 3 Super: Roughly 67k SFT trajectories from few seed datasets with research questions, filtered with a principle-based judge. This model powers the researcher and its sub-agents.

Custom middleware for long-horizon reliability. NeMo Agent Toolkit and LangChain middleware are extended with components that improve reliability and robustness.

Ensemble researcher and report refiner (optional): parallel pipeline outputs merged by an LLM, with a post-hoc refiner for maximum report quality.

Each is detailed in the sections that follow.

Fine-Tuned NVIDIA Nemotron 3 Super: Data and Training

A major factor in the results is a custom fine-tuned NVIDIA Nemotron-3-Super-120B-A12B model. We chose it for this workflow because it aligns well with multi-step agentic reasoning, tool use, and citation-grounded reporting; fine-tuning on real search-and-synthesis trajectories makes it effective for planner, researcher, and orchestrator roles at scale.

Trajectory generation

We collected research questions from multiple open-sourced datasets: about 17k questions from OpenScholar, 21k from ResearchQA and 2457 questions from Fathom-DeepResearch-SFT.

Then we generated ~80k trajectories for the full workflow using the open-sourced GPT-OSS-120B model. Each trajectory covers planner, researcher, and orchestrator behavior.

It's worth noting that these trajectories include real web search results from the Tavily and Serper APIs so the model learns to navigate and perform multi-step searches and synthesis on real data.

Principle-based filtering

Most of the trajectories did not complete on time or were stopped due to exceeding the tool call limit, but for those that did produce expected results, we additionally applied filtering using the judge model.

The completed trajectories were scored with the nvidia/Qwen3-Nemotron-32B-GenRM-Principle judge model, which predicts quality along dimensions such as comprehensiveness, readability, accuracy, and relevance.

After filtering, ~67k trajectories were retained for training.

Model: NVIDIA Nemotron-3-Super-120B-A12B

Setup: One epoch, 5,615 steps, approximately 25 hours on 16×8 NVIDIA H100 GPUs.

AI-Q Deep Researcher

AI-Q deep researcher adopts a multi-agent architecture (Orchestrator, Planner, and Researcher) with iterative plan → gather → synthesize loops, citation management, and custom middleware for long-horizon reliability. An optional ensemble and report refiner layer can be enabled for maximum report quality. The multi-agent design also serves as a long-context strategy: each subagent works within its own context window and returns only its synthesized output, so the orchestrator never sees the raw tool responses. This keeps the orchestrator's context focused and prevents long, noisy search results from degrading its reasoning.

Orchestrator: Coordinates the full research loop. Calls the Planner to produce an evidence-grounded research plan, then the Researcher multiple times with focused research tasks derived from that plan. After research completes, the orchestrator reviews the plan's quality constraints, dispatches targeted gap-filling research, and writes the long-form report. An optional refiner step makes edit to the report leveraging raw researcher briefs in a fresh context window - a second evidence recovery point.

Planner: Runs in two phases. A Scout subagent first maps the information landscape through broad searches. An Architect subagent then designs the research plan including report outline, targeted search queries, and quality constraints, while running its own searches to validate structural choices.

Evidence-grounded planning is key to producing reliable, high-quality reports. Our planner knows the information landscape before it commits to a structure. It decides where to go deep and broad based on what it actually found, not assumptions.

Researcher: Dispatches multiple specialist subagents in parallel, each with a distinct lens:

Evidence Gatherer: facts, statistics, specific numbers from authoritative sources

Mechanism Explorer: causal explanations, theoretical frameworks

Comparator: benchmarks, head-to-head data, trade-off analyses

Critic: counterarguments, limitations, failure cases

Horizon Scanner: recent developments, emerging trends

They share the same search tools, but with different analytical framing. Diverse specialists researching the same topic often surface evidence that a single generalist would miss.

The researcher synthesizes specialist findings into a unified, cited brief. An LLM then cross-checks this synthesis against the raw specialist outputs in a fresh context window, recovering any relevant information.

Config-Driven Flexibility Every component is swappable. LLMs, tools, and agent graphs can be configured through YAML. Planner, researcher, and orchestrator can each be powered by a different LLM. For the benchmark submission, a fine-tuned Nemotron 3 drives the researcher, which processes 4x more tokens than the planner and orchestrator combined.

Custom Middleware for Long-Horizon Reliability

Each agent and subagent interleaves LLM and tool calls across many steps (often 32+). At that scale, the system may fail in ways that short interactions never expose. Our agent harness provides custom middleware to handle and mitigate these:

Tool name sanitization: LLMs may hallucinate tool names mid-run. This middleware applies pattern-based cleaning, alias resolution, and fuzzy matching to recover the intended tool.

Reasoning-aware retry: LLMs with reasoning sometimes produce thinking tokens without a tool call or final response, which would silently terminate the agent loop. Middleware detects this, preserves the reasoning in context, and retries.

Budget enforcement: Each agent and subagent has its own tool-call cap. When the limit is reached, middleware nudges the LLM to synthesize first, then removes tools entirely to force a text-only response.

Report validation: Before returning output, middleware checks minimum length and section structure. Incomplete reports get retried with a continuation prompt.

Each middleware addresses failure patterns observed in agent traces. Together they keep long-horizon runs reliable.

Ensemble When enabled, N independent deep-research pipelines run in parallel. An LLM reads all N outputs, selects one as the structural base, and integrates unique content from the others. The ensemble produces broader evidence coverage than any single pipeline, directly improving comprehensiveness and information recall. A proofread pass removes process artifacts so the output reads as a single-authored work.

Post-hoc Refiner An optional final report refiner step can run over the report with structured instructions to quantify vague claims, deepen entity coverage, cut scaffolding, ground risks, build comparison tables, and strengthen causal reasoning. The rewriting prompt is derived via self-supervised meta-learning against reference reports generated from our pipeline with frontier LLMs only.

NVIDIA AI-Q reached first place on both Deep Research Bench and Deep Research Bench II with a single stack: a multi-agent deep researcher built on NVIDIA NeMo Agent Toolkit, fine-tuned NVIDIA Nemotron 3 models, and custom middleware, with an optional ensemble and refiner when maximum report quality is needed. The stack is open, reproducible, and configurable to your needs. State-of-the-art results without compromising on transparency or control.

Join us at NVIDIA GTC in San Jose the week of March 16, 2026 to learn more.

S81706 - Evaluation-Driven Development: Best Practices for Building Reliable Agents

DLIT81725 - Develop Production Agents with Eval-Driven Design Dhruv Nandakumar US

S81570 - From Data to Decisions: Enabling AI Agents with Business Knowledge

S81569 - Self-Coding Agents: Architectures, Data Flywheels, and Autonomous Code Repair

S81789 - Open Source AI Shaping the Next Era of Intelligent Digital Workers

この記事をシェア

NVIDIA Developer Blog重要度42026年3月14日 01:00

NVIDIA Cosmos World Foundation Modelsによる合成データのスケーリングと物理AI推論

TLDR AI重要度42026年5月19日 09:00

ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング（9 分読了）

Hugging Face Blog重要度42026年5月19日 01:00

ロボット動画生成のための NVIDIA Cosmos Predict 2.5 の LoRA/DoRA を用いたファインチューニング

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む