NeMo Evaluator

June 3, 2026 · View on GitHub

License Python 3.12-3.13 Code style: ruff

Documentation | GitHub | Issues

LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.


Install

pip install -e .                   # core
pip install -e ".[scoring]"        # + sympy for symbolic math
pip install -e ".[stats]"          # + scipy (regression analysis)
pip install -e ".[scoring,stats]"  # + sympy + scipy for confidence intervals
pip install -e ".[harbor]"         # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]"        # + Inspect AI log export
pip install -e ".[all]"            # common runtime integrations

Quick Start

export NVIDIA_API_KEY="your-api-key-here"

# Run a benchmark from the CLI
nel eval run --bench mmlu \
  --model-url https://integrate.api.nvidia.com/v1 \
  --model-id nvidia/nemotron-3-super-120b-a12b \
  --api-key $NVIDIA_API_KEY \
  --repeats 3 --max-problems 100

# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume

# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md

Benchmarks

17 built-in benchmarks plus external harness integrations:

BenchmarkTypeScoring
mmlu, mmlu_pro, gpqaMultichoicemultichoice_regex
gsm8k, math500, mgsmMathnumeric_match / answer_line
drop, triviaqaQAfuzzy_match
humanevalCodecode_sandbox (Docker)
simpleqa, healthbenchJudgeneeds_judge
pinchbenchAgenticcode_sandbox / needs_judge
xstestSafetyneeds_judge
terminal-bench-hard, terminal-bench-v1Terminal tasksTask test harness
nmp_harborAgentic NMPHarbor task tests

External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.

Adapter Proxy

Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.

services:
  nemotron:
    type: api
    url: https://integrate.api.nvidia.com/v1/chat/completions
    protocol: chat_completions
    model: nvidia/nemotron-3-super-120b-a12b
    api_key: ${NVIDIA_API_KEY}
    proxy:
      request_timeout: 600
      interceptors:
        - name: turn_counter
          config:
            max_turns: 100
        - name: drop_params
          config:
            params: [max_tokens]
      verbose: true

Available interceptors:

InterceptorStageDescription
endpointrequest→responseAsync HTTP forwarding with retry, backoff, connection pooling
cachingrequest→responseDisk-backed SQLite cache with canonical keys
turn_counterrequestPer-session turn counting with budget injection
drop_paramsrequestStrip named parameters from requests
modify_toolsrequestAdd/remove properties in tool schemas
system_messagerequestInject/replace/prepend system messages
payload_modifierrequestRecursive parameter add/remove/rename
raise_client_errorsresponseConvert 4xx to exceptions
log_tokensresponseLog token usage per request
response_statsresponseAggregate timing and token statistics
reasoningresponseNormalize <think> blocks to reasoning_content
progress_trackingresponseProgress counter with optional webhook
loggingrequest + responseRequest/response logging with body preview

Solvers

Configured via solver.type in each benchmark:

Solver TypeConfig typeUse Case
SimpleSolversimpleStandard chat/completion/VLM (default)
HarborSolverharborHarbor agents (OpenHands, Terminus-2, etc.)
ToolCallingSolvertool_callingTool-use with Gym resource servers
GymDelegationSolvergym_delegationDelegate to nemo-gym server
OpenClawSolveropenclawOpenClaw CLI agent
ContainerSolvercontainerLegacy container harness

Export

Evaluation results can be exported to experiment trackers and compatible formats:

output:
  export: [inspect, wandb, mlflow]
  • inspect — Produces inspect_ai-compatible EvalLog JSON files. Install with pip install -e ".[inspect]".
  • wandb / mlflow — Push scores and artifacts to experiment trackers. Install with pip install -e ".[export]".

BYOB (Bring Your Own Benchmark)

from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match

@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
           prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return exact_match(sample)

Sandboxes

Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).

SLURM

Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.

Tag suffixContents
:latestBase + gym + vlmevalkit
:latest-lm-eval+ lm-evaluation-harness
:latest-skills+ NeMo Skills
:latest-fullAll harnesses

CLI

CommandPurpose
nel eval runRun evaluation (name or YAML)
nel eval merge <dir>Merge sharded results
nel eval report <dir>Generate reports
nel listList benchmarks
nel serve -b <name>Serve as HTTP endpoint
nel validate -b <name>Sanity check
nel export <paths> --dest <exporter>Export bundles
nel cache-sqsh <image>Build a SLURM .sqsh cache image
nel report <dir>Generate multi-benchmark reports
nel comparePaired run comparison
nel gateMulti-benchmark quality gate
nel configPersistent user config
nel packageContainerize BYOB benchmark

Compare Results Between Runs

Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.

nel compare ./results/baseline ./results/candidate --strict

Full tutorial: docs/tutorials/compare.md

Implement Quality Gates

Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.

nel gate ./results/baseline ./results/candidate \
  --policy gate_policy.yaml \
  --strict \
  --output gate_report.json

Full tutorial: docs/tutorials/quality-gate.md

Examples

See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.

License

Apache 2.0