NeMo Evaluator
June 3, 2026 · View on GitHub
Documentation | GitHub | Issues
LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.
Install
pip install -e . # core
pip install -e ".[scoring]" # + sympy for symbolic math
pip install -e ".[stats]" # + scipy (regression analysis)
pip install -e ".[scoring,stats]" # + sympy + scipy for confidence intervals
pip install -e ".[harbor]" # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]" # + Inspect AI log export
pip install -e ".[all]" # common runtime integrations
Quick Start
export NVIDIA_API_KEY="your-api-key-here"
# Run a benchmark from the CLI
nel eval run --bench mmlu \
--model-url https://integrate.api.nvidia.com/v1 \
--model-id nvidia/nemotron-3-super-120b-a12b \
--api-key $NVIDIA_API_KEY \
--repeats 3 --max-problems 100
# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume
# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md
Benchmarks
17 built-in benchmarks plus external harness integrations:
| Benchmark | Type | Scoring |
|---|---|---|
| mmlu, mmlu_pro, gpqa | Multichoice | multichoice_regex |
| gsm8k, math500, mgsm | Math | numeric_match / answer_line |
| drop, triviaqa | QA | fuzzy_match |
| humaneval | Code | code_sandbox (Docker) |
| simpleqa, healthbench | Judge | needs_judge |
| pinchbench | Agentic | code_sandbox / needs_judge |
| xstest | Safety | needs_judge |
| terminal-bench-hard, terminal-bench-v1 | Terminal tasks | Task test harness |
| nmp_harbor | Agentic NMP | Harbor task tests |
External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.
Adapter Proxy
Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.
services:
nemotron:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
proxy:
request_timeout: 600
interceptors:
- name: turn_counter
config:
max_turns: 100
- name: drop_params
config:
params: [max_tokens]
verbose: true
Available interceptors:
| Interceptor | Stage | Description |
|---|---|---|
endpoint | request→response | Async HTTP forwarding with retry, backoff, connection pooling |
caching | request→response | Disk-backed SQLite cache with canonical keys |
turn_counter | request | Per-session turn counting with budget injection |
drop_params | request | Strip named parameters from requests |
modify_tools | request | Add/remove properties in tool schemas |
system_message | request | Inject/replace/prepend system messages |
payload_modifier | request | Recursive parameter add/remove/rename |
raise_client_errors | response | Convert 4xx to exceptions |
log_tokens | response | Log token usage per request |
response_stats | response | Aggregate timing and token statistics |
reasoning | response | Normalize <think> blocks to reasoning_content |
progress_tracking | response | Progress counter with optional webhook |
logging | request + response | Request/response logging with body preview |
Solvers
Configured via solver.type in each benchmark:
| Solver Type | Config type | Use Case |
|---|---|---|
| SimpleSolver | simple | Standard chat/completion/VLM (default) |
| HarborSolver | harbor | Harbor agents (OpenHands, Terminus-2, etc.) |
| ToolCallingSolver | tool_calling | Tool-use with Gym resource servers |
| GymDelegationSolver | gym_delegation | Delegate to nemo-gym server |
| OpenClawSolver | openclaw | OpenClaw CLI agent |
| ContainerSolver | container | Legacy container harness |
Export
Evaluation results can be exported to experiment trackers and compatible formats:
output:
export: [inspect, wandb, mlflow]
inspect— Producesinspect_ai-compatibleEvalLogJSON files. Install withpip install -e ".[inspect]".wandb/mlflow— Push scores and artifacts to experiment trackers. Install withpip install -e ".[export]".
BYOB (Bring Your Own Benchmark)
from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match
@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return exact_match(sample)
Sandboxes
Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).
SLURM
Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.
| Tag suffix | Contents |
|---|---|
:latest | Base + gym + vlmevalkit |
:latest-lm-eval | + lm-evaluation-harness |
:latest-skills | + NeMo Skills |
:latest-full | All harnesses |
CLI
| Command | Purpose |
|---|---|
nel eval run | Run evaluation (name or YAML) |
nel eval merge <dir> | Merge sharded results |
nel eval report <dir> | Generate reports |
nel list | List benchmarks |
nel serve -b <name> | Serve as HTTP endpoint |
nel validate -b <name> | Sanity check |
nel export <paths> --dest <exporter> | Export bundles |
nel cache-sqsh <image> | Build a SLURM .sqsh cache image |
nel report <dir> | Generate multi-benchmark reports |
nel compare | Paired run comparison |
nel gate | Multi-benchmark quality gate |
nel config | Persistent user config |
nel package | Containerize BYOB benchmark |
Compare Results Between Runs
Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.
nel compare ./results/baseline ./results/candidate --strict
Full tutorial: docs/tutorials/compare.md
Implement Quality Gates
Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.
nel gate ./results/baseline ./results/candidate \
--policy gate_policy.yaml \
--strict \
--output gate_report.json
Full tutorial: docs/tutorials/quality-gate.md
Examples
See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.