Hugging Face Blog·2026年3月7日 03:56·約1分

NVIDIA NeMo Evaluator Agent Skillsによる数分での対話型LLM評価

#LLM評価 #開発者ツール #ワークフロー自動化 #NVIDIA #エージェント #自然言語インターフェース

TL;DR

NVIDIAがNeMo Evaluatorライブラリに基づく新しい「nel-assistant」エージェントスキルを発表し、開発者が自然言語でLLM評価を設定・実行・監視できるようにすることで、従来の複雑なYAMLファイル作成というボトルネックを解消する。

AI深層分析2026年3月7日 04:41

重要/ 5段階

深度40%

キーポイント

評価設定のボトルネック解消

従来、LLM評価の実行には多数の相互接続された決定（実行環境、デプロイ方法、モデルパラメータなど）が必要で、長く複雑なYAMLファイルの手動作成が開発のボトルネックとなっていた。

自然言語による評価設定

新しい「nel-assistant」エージェントスキルにより、開発者は自然言語で本番環境対応の評価を設定できるようになり、YAMLファイルやシェルコマンドを手動で作成する必要がなくなった。

エージェント開発ツールとの統合

このスキルはCursorやその他の好みのエージェント開発ツール内で直接評価を設定、実行、監視できるように構築されており、開発者のワークフローに統合されている。

NVIDIA NeMo Evaluator基盤

この機能はNVIDIA NeMo Evaluatorライブラリ上に構築されており、NVIDIAのLLM評価インフラストラクチャを活用している。

エージェントスキルによる評価設定の簡素化

エージェントスキルを使用することで、複雑なYAML設定ファイルを手動で作成・デバッグする代わりに、会話を通じて評価設定を自動生成できる。

モデルカードの自動調査と最適化

エージェントスキルはモデルカードを調査し、最適な温度、top_p値、コンテキスト長、GPU設定に適したテンソル並列処理を自動的に特定する。

設定プロセスの3段階アプローチ

設定フェーズでは、実行環境、デプロイメントバックエンド、エクスポート先、モデルタイプ、ベンチマークカテゴリの5つの質問に基づいてコンテキストを確立する。

影響分析・編集コメントを表示

影響分析

この発表はLLM開発・評価ワークフローの重要な効率化を実現し、特に評価設定の複雑さという実務上の課題を解決する。自然言語による設定は技術的障壁を下げ、より多くの開発者が高度なLLM評価を実施できるようになる可能性があり、AI開発の民主化を促進する。

編集コメント

LLM評価の設定プロセスを自然言語で簡素化する実用的なソリューションで、開発者の日常的な課題に直接応える内容。技術的な革新性よりも実用性とワークフロー改善に焦点を当てた発表。

LLM評価には、すでに多くの重要な判断が必要です — ベンチマークの選択、結果の解釈、モデルの比較など。設定はそのプロセスを支援すべきであり、妨げるものではありません。

nel-assistantスキルはその設定を隠蔽します。あなたは自然言語で目的を説明し、エージェントが残りを処理します：モデルカードの調査、設定の生成、セットアップの検証、段階的ロールアウト、進捗の監視などです。

200行に及ぶYAMLファイルはもう必要ありません。ドキュメントを探し回る必要も、構文エラーに悩まされることもありません。

必要なのは、「このモデルをこれらのベンチマークで評価してください」と指示することだけです。

GitHub: NVIDIA NeMo Evaluator

チュートリアル: nel-assistant

Agent Skills仕様: agentskills.io

nel-assistantスキルはオープンソースで、NVIDIA NeMo Evaluator 26.01+に同梱されています。GitHubでの貢献をお待ちしています！

原文を表示

Back to Articles Conversational LLM Evaluations in Minutes with NVIDIA NeMo Evaluator Agent Skills

Upvote - Seph Mard sephmard1 Follow nvidia Besmira Nushi bnushi Follow nvidia Grzegorz Chlebus grzegorzchlebus Follow nvidia Piotr Januszewski pjanuszewski Follow nvidia Pablo Ribalta pribalta Follow nvidia Sylendran Arunagiri Sylendran95 Follow nvidia VivienneZhang viviennezhang Follow nvidia Nik Spirin spirinus Follow nvidia Running LLM evaluations should not require manually drafting long and complex YAML files. For developers, configuration overhead often becomes the bottleneck. The new nel-assistant agent skill enables natural language configuration of production-ready evaluations.

Built on the NVIDIA NeMo Evaluator library, it allows developers to configure, run, and monitor evaluations directly within Cursor, or any other preferred agentic development tool. All through interaction with the agent and not manually creating YAML files or shell commands.

The Problem: Configuration Overhead

Running a single LLM evaluation means making dozens of interconnected decisions:

Execution: Local Docker or SLURM cluster?

Deployment: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external endpoint? How many nodes?

Model: What temperature? What context length? Does it use reasoning tokens?

Benchmarks: Tau2-Bench, MTEB, GSM8K, AIME, GPQA, LiveCodeBench, RULER, more? All of the above?

Export: Local files, CSV, Weights & Biases, or MLflow?

Each choice spawns sub-choices. Using vLLM? Configure tensor parallelism. Running reasoning models? Parse thinking tokens. Multi-node SLURM? Set up HAProxy load balancing. The result is an overhead of working through complex YAML configs that are easy to get wrong and hard to debug.

A typical evaluation config - 200+ lines, easy to misconfigure execution: backend: slurm params: account: ??? # What's my account name again? partition: ??? time_limit: "24:00:00" deployment: backend: vllm params: tensor_parallel_size: ??? # 2? 4? 8? max_model_len: ??? # Model card says... 32K? 128K? model: sampling_params: temperature: ??? # Was it 0.6 or 0.8? top_p: ??? # ... 150 more lines ...

This configuration burden turns a simple evaluation into unnecessary complexity.

The Solution: Agent Skills for LLM Evaluation

Agent skills have become the standard way to give agents deep domain expertise. A "skill" packages instructions, scripts, and resources into portable folders that any compatible agent can discover and use.

In the coding assistants space, the ecosystem has exploded covering everything from Git workflows to API integrations. The nel-assistant is an agent skill that transforms Cursor, Claude Code, Codex, and other agentic developer tools into an evaluation expert. Instead of manual YAML authoring, you have a conversation. The skill researches model cards to find optimal temperatures, top_p values, and context lengths, and identifies the optimal tensor parallelism for your specific GPU setup.

One conversation. Zero manual YAML debugging. Production-ready config.

You: "Evaluate NVIDIA Nemotron-3-Nano-30B-A3B on standard benchmarks using vLLM locally. Export to Weights & Biases." Agent: I'll help you set that up. First, let me check your environment... ✓ NeMo Evaluator 26.01 detected I'll need 5 quick answers to build your config: 1. Execution: local (detected from your request) 2. Deployment: vllm (confirmed) 3. Export: wandb (confirmed) 4. Model type: chat or base? [chat] 5. Benchmark category: standard (confirmed) Now let me research the model card... ✓ Found: temperature=0.6, top_p=0.95, context=128K ✓ Optimal TP=8 for your 2x H100 setup Generated config: Nemotron-3-Nano-30B-A3B.yaml Ready to run!

How It Works: Three Phases

Phase 1: Configure

The skill starts by asking five targeted questions to establish context:

Execution environment: Local or SLURM?

Deployment backend: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external?

Export destination: None, MLflow, or Weights & Biases?

Model type: Base, chat, or reasoning?

Benchmark categories: Standard, code, math, safety, or multilingual?

From these answers, it calls:

nel skills build-config \ --execution local \ --deployment vllm \ --model-type chat \ --benchmarks standard

This deep-merges modular YAML templates into tested, schema-compliant fragments that compose into structurally valid configs and minimizes syntax errors. With the skill alongside, the agent never generates free-form YAML, eliminating syntax errors.

Next, the agent automatically analyzes the model card and applies optimal configuration parameters.

Give the agent a HuggingFace handle NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Sampling params: Temperature, top_p

Hardware logic: Optimal TP/DP settings based on your GPU count

Reasoning config: System prompts, payload modifiers (e.g., enable_thinking

Context length: Max model length for vLLM --max-model-len

Developers no longer need to search through model cards to find the right settings. The agent reads the model details and applies the correct parameters automatically.

Without the skill, this usually means jumping between Hugging Face, blog posts, and documentation. It takes time and breaks focus. With the skill, the setup happens in seconds.

Phase 2: Validate and Refine

The skill identifies the remaining ???

SLURM details: Account names, partition names, time limits

Export URIs: WandB project names, MLflow tracking URIs

API keys: Environment variables for deployments

You can interactively:

Add/remove tasks: Browse nel ls tasks

Override per-task settings: "Use temperature=0 for HumanEval but 0.7 for MMLU"

Configure advanced scaling: For >120B models, set up data-parallel multi-node with HAProxy load balancing

Add reasoning interceptors: Strip <think>

Phase 3: Run and Monitor

The agent proposes a three-tier staged rollout: Dry run, Smoke test, and Full run.

1. Dry run - validate without execution nel run --config nemotron-3-nano.yaml --dry-run # 2. Smoke test - 10 samples per task nel run --config nemotron-3-nano.yaml \ -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # 3. Full run nel run --config nemotron-3-nano.yaml

Once submitted, progress can be monitored directly in Cursor using commands for status, detailed metrics, and live logs. You never leave your coding environment!

Please, check the evaluation progress. # Agent runs: nel status nemotron-3-nano-20260212-143022 && nel info ... Status: RUNNING Progress: 3/8 tasks completed - ✓ mmlu: 65.2% accuracy (5 hours) - ✓ hellaswag: 78.4% accuracy (2 hours) - ✓ arc_challenge: 53.8% accuracy (1 hour) - ⏳ truthfulqa_mc2: 45% complete... - ⏳ winogrande: In queue - ⏳ gsm8k: In queue - ⏳ humaneval: In queue - ⏳ mbpp: In queue

Technical Details

Template-Based Generation

Instead of generating YAML from scratch, nel-assistant merges modular templates for execution, deployment, benchmarks, and exports. This deep merge ensures structural validity.

Model Card Extraction Pipeline

Cursor or your agentic IDE fetches the HuggingFace model card via web search.

Extraction via regex identifies parameters and chat templates.

Hardware logic calculates optimal TP/DP based on model size and available GPU memory.

Reasoning detection checks for keywords like "reasoning" or "chain-of-thought."

Values are injected directly into the config YAML.

Generic LLMs hallucinate YAML syntax. They mix incompatible backends. They invent flags that don't exist.

Instead of generating YAML from scratch, nel skills build-config

templates/ ├── execution/ │ ├── local.yaml # Docker execution │ └── slurm.yaml # SLURM execution ├── deployment/ │ ├── vllm.yaml # vLLM backend │ ├── sglang.yaml # SGLang backend │ └── nim.yaml # NVIDIA NIM ├── benchmarks/ │ ├── reasoning.yaml # GPQA-D, HellaSwag, SciCode, MATH, AIME │ └── agentic.yaml # TerminalBench, SWE-Bench │ ├── longcontext.yaml # AA-LCR, RULER │ ├── instruction.yaml # IFBench, ArenaHard │ ├── multi-lingual.yaml # MMLU-ProX, WMT24++ └── export/ ├── wandb.yaml # W&B integration └── mlflow.yaml # MLflow integration

Deep merge = structural validity. You can't produce invalid YAML when you're composing pre-validated fragments.

The nel-assistant uses build-config

Configuration Should Not Be a Bottleneck

LLM evaluation already involves important decisions — selecting benchmarks, interpreting results, and comparing models. Configuration should support that process, not slow it down.

The nel-assistant skill makes it invisible. You describe what you want in natural language, and the agent handles the rest: researching model cards, generating configs, validating setups, staging rollouts, and monitoring progress.

No more 200-line YAML files. No more hunting through documentation. No more syntax errors.

Just: "Evaluate this model on these benchmarks."

GitHub: NVIDIA NeMo Evaluator

Tutorial: nel-assistant

Agent Skills Spec: agentskills.io

The nel-assistant skill is open-source and ships with NVIDIA NeMo Evaluator 26.01+. Contributions welcome on GitHub!

この記事をシェア

NVIDIA Developer Blog重要度42026年3月6日 03:00

NVIDIA Blackwellが金融分野におけるLLM推論でSTAC-AI記録を樹立

NVIDIA Developer Blog2026年5月13日 03:00

AI モデル推論パイプラインの摩擦を解消する方法

TLDR AI重要度42026年5月11日 09:00

Nvidia、AI投資家としての役割を強化し今年400億ドル超の株式投資へ

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年3月7日 03:56·約1分

NVIDIA NeMo Evaluator Agent Skillsによる数分での対話型LLM評価

#LLM評価 #開発者ツール #ワークフロー自動化 #NVIDIA #エージェント #自然言語インターフェース

TL;DR

AI深層分析2026年3月7日 04:41

重要/ 5段階

深度40%

キーポイント

評価設定のボトルネック解消

自然言語による評価設定

エージェント開発ツールとの統合

NVIDIA NeMo Evaluator基盤

この機能はNVIDIA NeMo Evaluatorライブラリ上に構築されており、NVIDIAのLLM評価インフラストラクチャを活用している。

エージェントスキルによる評価設定の簡素化

エージェントスキルを使用することで、複雑なYAML設定ファイルを手動で作成・デバッグする代わりに、会話を通じて評価設定を自動生成できる。

モデルカードの自動調査と最適化

エージェントスキルはモデルカードを調査し、最適な温度、top_p値、コンテキスト長、GPU設定に適したテンソル並列処理を自動的に特定する。

設定プロセスの3段階アプローチ

影響分析・編集コメントを表示

影響分析

編集コメント

200行に及ぶYAMLファイルはもう必要ありません。ドキュメントを探し回る必要も、構文エラーに悩まされることもありません。

必要なのは、「このモデルをこれらのベンチマークで評価してください」と指示することだけです。

GitHub: NVIDIA NeMo Evaluator

チュートリアル: nel-assistant

Agent Skills仕様: agentskills.io

nel-assistantスキルはオープンソースで、NVIDIA NeMo Evaluator 26.01+に同梱されています。GitHubでの貢献をお待ちしています！

原文を表示

Back to Articles Conversational LLM Evaluations in Minutes with NVIDIA NeMo Evaluator Agent Skills

The Problem: Configuration Overhead

Running a single LLM evaluation means making dozens of interconnected decisions:

Execution: Local Docker or SLURM cluster?

Deployment: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external endpoint? How many nodes?

Model: What temperature? What context length? Does it use reasoning tokens?

Benchmarks: Tau2-Bench, MTEB, GSM8K, AIME, GPQA, LiveCodeBench, RULER, more? All of the above?

Export: Local files, CSV, Weights & Biases, or MLflow?

A typical evaluation config - 200+ lines, easy to misconfigure execution: backend: slurm params: account: ??? # What's my account name again? partition: ??? time_limit: "24:00:00" deployment: backend: vllm params: tensor_parallel_size: ??? # 2? 4? 8? max_model_len: ??? # Model card says... 32K? 128K? model: sampling_params: temperature: ??? # Was it 0.6 or 0.8? top_p: ??? # ... 150 more lines ...

This configuration burden turns a simple evaluation into unnecessary complexity.

The Solution: Agent Skills for LLM Evaluation

One conversation. Zero manual YAML debugging. Production-ready config.

How It Works: Three Phases

Phase 1: Configure

The skill starts by asking five targeted questions to establish context:

Execution environment: Local or SLURM?

Deployment backend: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external?

Export destination: None, MLflow, or Weights & Biases?

Model type: Base, chat, or reasoning?

Benchmark categories: Standard, code, math, safety, or multilingual?

From these answers, it calls:

nel skills build-config \ --execution local \ --deployment vllm \ --model-type chat \ --benchmarks standard

Next, the agent automatically analyzes the model card and applies optimal configuration parameters.

Give the agent a HuggingFace handle NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Sampling params: Temperature, top_p

Hardware logic: Optimal TP/DP settings based on your GPU count

Reasoning config: System prompts, payload modifiers (e.g., enable_thinking

Context length: Max model length for vLLM --max-model-len

Developers no longer need to search through model cards to find the right settings. The agent reads the model details and applies the correct parameters automatically.

Without the skill, this usually means jumping between Hugging Face, blog posts, and documentation. It takes time and breaks focus. With the skill, the setup happens in seconds.

Phase 2: Validate and Refine

The skill identifies the remaining ???

SLURM details: Account names, partition names, time limits

Export URIs: WandB project names, MLflow tracking URIs

API keys: Environment variables for deployments

You can interactively:

Add/remove tasks: Browse nel ls tasks

Override per-task settings: "Use temperature=0 for HumanEval but 0.7 for MMLU"

Configure advanced scaling: For >120B models, set up data-parallel multi-node with HAProxy load balancing

Add reasoning interceptors: Strip <think>

Phase 3: Run and Monitor

The agent proposes a three-tier staged rollout: Dry run, Smoke test, and Full run.

1. Dry run - validate without execution nel run --config nemotron-3-nano.yaml --dry-run # 2. Smoke test - 10 samples per task nel run --config nemotron-3-nano.yaml \ -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # 3. Full run nel run --config nemotron-3-nano.yaml

Once submitted, progress can be monitored directly in Cursor using commands for status, detailed metrics, and live logs. You never leave your coding environment!

Please, check the evaluation progress. # Agent runs: nel status nemotron-3-nano-20260212-143022 && nel info ... Status: RUNNING Progress: 3/8 tasks completed - ✓ mmlu: 65.2% accuracy (5 hours) - ✓ hellaswag: 78.4% accuracy (2 hours) - ✓ arc_challenge: 53.8% accuracy (1 hour) - ⏳ truthfulqa_mc2: 45% complete... - ⏳ winogrande: In queue - ⏳ gsm8k: In queue - ⏳ humaneval: In queue - ⏳ mbpp: In queue

Technical Details

Template-Based Generation

Instead of generating YAML from scratch, nel-assistant merges modular templates for execution, deployment, benchmarks, and exports. This deep merge ensures structural validity.

Model Card Extraction Pipeline

Cursor or your agentic IDE fetches the HuggingFace model card via web search.

Extraction via regex identifies parameters and chat templates.

Hardware logic calculates optimal TP/DP based on model size and available GPU memory.

Reasoning detection checks for keywords like "reasoning" or "chain-of-thought."

Values are injected directly into the config YAML.

Generic LLMs hallucinate YAML syntax. They mix incompatible backends. They invent flags that don't exist.

Instead of generating YAML from scratch, nel skills build-config

Deep merge = structural validity. You can't produce invalid YAML when you're composing pre-validated fragments.

The nel-assistant uses build-config

Configuration Should Not Be a Bottleneck

LLM evaluation already involves important decisions — selecting benchmarks, interpreting results, and comparing models. Configuration should support that process, not slow it down.

No more 200-line YAML files. No more hunting through documentation. No more syntax errors.

Just: "Evaluate this model on these benchmarks."

GitHub: NVIDIA NeMo Evaluator

Tutorial: nel-assistant

Agent Skills Spec: agentskills.io

The nel-assistant skill is open-source and ships with NVIDIA NeMo Evaluator 26.01+. Contributions welcome on GitHub!