NVIDIA NeMo Evaluator Agent Skillsによる数分での対話型LLM評価
NVIDIAがNeMo Evaluatorライブラリに基づく新しい「nel-assistant」エージェントスキルを発表し、開発者が自然言語でLLM評価を設定・実行・監視できるようにすることで、従来の複雑なYAMLファイル作成というボトルネックを解消する。
キーポイント
評価設定のボトルネック解消
従来、LLM評価の実行には多数の相互接続された決定(実行環境、デプロイ方法、モデルパラメータなど)が必要で、長く複雑なYAMLファイルの手動作成が開発のボトルネックとなっていた。
自然言語による評価設定
新しい「nel-assistant」エージェントスキルにより、開発者は自然言語で本番環境対応の評価を設定できるようになり、YAMLファイルやシェルコマンドを手動で作成する必要がなくなった。
エージェント開発ツールとの統合
このスキルはCursorやその他の好みのエージェント開発ツール内で直接評価を設定、実行、監視できるように構築されており、開発者のワークフローに統合されている。
NVIDIA NeMo Evaluator基盤
この機能はNVIDIA NeMo Evaluatorライブラリ上に構築されており、NVIDIAのLLM評価インフラストラクチャを活用している。
エージェントスキルによる評価設定の簡素化
エージェントスキルを使用することで、複雑なYAML設定ファイルを手動で作成・デバッグする代わりに、会話を通じて評価設定を自動生成できる。
モデルカードの自動調査と最適化
エージェントスキルはモデルカードを調査し、最適な温度、top_p値、コンテキスト長、GPU設定に適したテンソル並列処理を自動的に特定する。
設定プロセスの3段階アプローチ
設定フェーズでは、実行環境、デプロイメントバックエンド、エクスポート先、モデルタイプ、ベンチマークカテゴリの5つの質問に基づいてコンテキストを確立する。
影響分析・編集コメントを表示
影響分析
この発表はLLM開発・評価ワークフローの重要な効率化を実現し、特に評価設定の複雑さという実務上の課題を解決する。自然言語による設定は技術的障壁を下げ、より多くの開発者が高度なLLM評価を実施できるようになる可能性があり、AI開発の民主化を促進する。
編集コメント
LLM評価の設定プロセスを自然言語で簡素化する実用的なソリューションで、開発者の日常的な課題に直接応える内容。技術的な革新性よりも実用性とワークフロー改善に焦点を当てた発表。
LLM評価には、すでに多くの重要な判断が必要です — ベンチマークの選択、結果の解釈、モデルの比較など。設定はそのプロセスを支援すべきであり、妨げるものではありません。
nel-assistantスキルはその設定を隠蔽します。あなたは自然言語で目的を説明し、エージェントが残りを処理します:モデルカードの調査、設定の生成、セットアップの検証、段階的ロールアウト、進捗の監視などです。
200行に及ぶYAMLファイルはもう必要ありません。ドキュメントを探し回る必要も、構文エラーに悩まされることもありません。
必要なのは、「このモデルをこれらのベンチマークで評価してください」と指示することだけです。
GitHub: NVIDIA NeMo Evaluator
チュートリアル: nel-assistant
Agent Skills仕様: agentskills.io
nel-assistantスキルはオープンソースで、NVIDIA NeMo Evaluator 26.01+に同梱されています。GitHubでの貢献をお待ちしています!

原文を表示
Back to Articles Conversational LLM Evaluations in Minutes with NVIDIA NeMo Evaluator Agent Skills
Upvote -
Seph Mard sephmard1 Follow
nvidia
Besmira Nushi bnushi Follow
nvidia
Grzegorz Chlebus grzegorzchlebus Follow
nvidia
Piotr Januszewski pjanuszewski Follow
nvidia Pablo Ribalta pribalta Follow
nvidia
Sylendran Arunagiri Sylendran95 Follow
nvidia
VivienneZhang viviennezhang Follow
nvidia
Nik Spirin spirinus Follow
nvidia Running LLM evaluations should not require manually drafting long and complex YAML files. For developers, configuration overhead often becomes the bottleneck. The new nel-assistant agent skill enables natural language configuration of production-ready evaluations.
Built on the NVIDIA NeMo Evaluator library, it allows developers to configure, run, and monitor evaluations directly within Cursor, or any other preferred agentic development tool. All through interaction with the agent and not manually creating YAML files or shell commands.
The Problem: Configuration Overhead
Running a single LLM evaluation means making dozens of interconnected decisions:
Execution: Local Docker or SLURM cluster?
Deployment: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external endpoint? How many nodes?
Model: What temperature? What context length? Does it use reasoning tokens?
Benchmarks: Tau2-Bench, MTEB, GSM8K, AIME, GPQA, LiveCodeBench, RULER, more? All of the above?
Export: Local files, CSV, Weights & Biases, or MLflow?
Each choice spawns sub-choices. Using vLLM? Configure tensor parallelism. Running reasoning models? Parse thinking tokens. Multi-node SLURM? Set up HAProxy load balancing. The result is an overhead of working through complex YAML configs that are easy to get wrong and hard to debug.
A typical evaluation config - 200+ lines, easy to misconfigure execution: backend: slurm params: account: ??? # What's my account name again? partition: ??? time_limit: "24:00:00" deployment: backend: vllm params: tensor_parallel_size: ??? # 2? 4? 8? max_model_len: ??? # Model card says... 32K? 128K? model: sampling_params: temperature: ??? # Was it 0.6 or 0.8? top_p: ??? # ... 150 more lines ...
This configuration burden turns a simple evaluation into unnecessary complexity.
The Solution: Agent Skills for LLM Evaluation
Agent skills have become the standard way to give agents deep domain expertise. A "skill" packages instructions, scripts, and resources into portable folders that any compatible agent can discover and use.
In the coding assistants space, the ecosystem has exploded covering everything from Git workflows to API integrations. The nel-assistant is an agent skill that transforms Cursor, Claude Code, Codex, and other agentic developer tools into an evaluation expert. Instead of manual YAML authoring, you have a conversation. The skill researches model cards to find optimal temperatures, top_p values, and context lengths, and identifies the optimal tensor parallelism for your specific GPU setup.
One conversation. Zero manual YAML debugging. Production-ready config.
You: "Evaluate NVIDIA Nemotron-3-Nano-30B-A3B on standard benchmarks using vLLM locally. Export to Weights & Biases." Agent: I'll help you set that up. First, let me check your environment... ✓ NeMo Evaluator 26.01 detected I'll need 5 quick answers to build your config: 1. Execution: local (detected from your request) 2. Deployment: vllm (confirmed) 3. Export: wandb (confirmed) 4. Model type: chat or base? [chat] 5. Benchmark category: standard (confirmed) Now let me research the model card... ✓ Found: temperature=0.6, top_p=0.95, context=128K ✓ Optimal TP=8 for your 2x H100 setup Generated config: Nemotron-3-Nano-30B-A3B.yaml Ready to run!
How It Works: Three Phases
Phase 1: Configure
The skill starts by asking five targeted questions to establish context:
Execution environment: Local or SLURM?
Deployment backend: vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external?
Export destination: None, MLflow, or Weights & Biases?
Model type: Base, chat, or reasoning?
Benchmark categories: Standard, code, math, safety, or multilingual?
From these answers, it calls:
nel skills build-config \ --execution local \ --deployment vllm \ --model-type chat \ --benchmarks standard
This deep-merges modular YAML templates into tested, schema-compliant fragments that compose into structurally valid configs and minimizes syntax errors. With the skill alongside, the agent never generates free-form YAML, eliminating syntax errors.
Next, the agent automatically analyzes the model card and applies optimal configuration parameters.
Give the agent a HuggingFace handle NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Sampling params: Temperature, top_p
Hardware logic: Optimal TP/DP settings based on your GPU count
Reasoning config: System prompts, payload modifiers (e.g., enable_thinking
Context length: Max model length for vLLM --max-model-len
Developers no longer need to search through model cards to find the right settings. The agent reads the model details and applies the correct parameters automatically.
Without the skill, this usually means jumping between Hugging Face, blog posts, and documentation. It takes time and breaks focus. With the skill, the setup happens in seconds.
Phase 2: Validate and Refine
The skill identifies the remaining ???
SLURM details: Account names, partition names, time limits
Export URIs: WandB project names, MLflow tracking URIs
API keys: Environment variables for deployments
You can interactively:
Add/remove tasks: Browse nel ls tasks
Override per-task settings: "Use temperature=0 for HumanEval but 0.7 for MMLU"
Configure advanced scaling: For >120B models, set up data-parallel multi-node with HAProxy load balancing
Add reasoning interceptors: Strip <think>
Phase 3: Run and Monitor
The agent proposes a three-tier staged rollout: Dry run, Smoke test, and Full run.
1. Dry run - validate without execution nel run --config nemotron-3-nano.yaml --dry-run # 2. Smoke test - 10 samples per task nel run --config nemotron-3-nano.yaml \ -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # 3. Full run nel run --config nemotron-3-nano.yaml
Once submitted, progress can be monitored directly in Cursor using commands for status, detailed metrics, and live logs. You never leave your coding environment!
Please, check the evaluation progress. # Agent runs: nel status nemotron-3-nano-20260212-143022 && nel info ... Status: RUNNING Progress: 3/8 tasks completed - ✓ mmlu: 65.2% accuracy (5 hours) - ✓ hellaswag: 78.4% accuracy (2 hours) - ✓ arc_challenge: 53.8% accuracy (1 hour) - ⏳ truthfulqa_mc2: 45% complete... - ⏳ winogrande: In queue - ⏳ gsm8k: In queue - ⏳ humaneval: In queue - ⏳ mbpp: In queue
Technical Details
Template-Based Generation
Instead of generating YAML from scratch, nel-assistant merges modular templates for execution, deployment, benchmarks, and exports. This deep merge ensures structural validity.
Model Card Extraction Pipeline
Cursor or your agentic IDE fetches the HuggingFace model card via web search.
Extraction via regex identifies parameters and chat templates.
Hardware logic calculates optimal TP/DP based on model size and available GPU memory.
Reasoning detection checks for keywords like "reasoning" or "chain-of-thought."
Values are injected directly into the config YAML.
Generic LLMs hallucinate YAML syntax. They mix incompatible backends. They invent flags that don't exist.
Instead of generating YAML from scratch, nel skills build-config
templates/ ├── execution/ │ ├── local.yaml # Docker execution │ └── slurm.yaml # SLURM execution ├── deployment/ │ ├── vllm.yaml # vLLM backend │ ├── sglang.yaml # SGLang backend │ └── nim.yaml # NVIDIA NIM ├── benchmarks/ │ ├── reasoning.yaml # GPQA-D, HellaSwag, SciCode, MATH, AIME │ └── agentic.yaml # TerminalBench, SWE-Bench │ ├── longcontext.yaml # AA-LCR, RULER │ ├── instruction.yaml # IFBench, ArenaHard │ ├── multi-lingual.yaml # MMLU-ProX, WMT24++ └── export/ ├── wandb.yaml # W&B integration └── mlflow.yaml # MLflow integration
Deep merge = structural validity. You can't produce invalid YAML when you're composing pre-validated fragments.
The nel-assistant uses build-config
Configuration Should Not Be a Bottleneck
LLM evaluation already involves important decisions — selecting benchmarks, interpreting results, and comparing models. Configuration should support that process, not slow it down.
The nel-assistant skill makes it invisible. You describe what you want in natural language, and the agent handles the rest: researching model cards, generating configs, validating setups, staging rollouts, and monitoring progress.
No more 200-line YAML files. No more hunting through documentation. No more syntax errors.
Just: "Evaluate this model on these benchmarks."
GitHub: NVIDIA NeMo Evaluator
Tutorial: nel-assistant
Agent Skills Spec: agentskills.io
The nel-assistant skill is open-source and ships with NVIDIA NeMo Evaluator 26.01+. Contributions welcome on GitHub!

関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み