Comparing Evaluation Runs

June 3, 2026 · View on GitHub

The Problem

You trained a new model, changed a prompt, or quantized a checkpoint. The headline score moved by a fraction of a percent. Is the change real, or is it noise?

A single average score hides critical information:

A candidate can swap improvements on some problems for regressions on others and keep the same headline.
A small benchmark (HumanEval: 164 items) can fluctuate by several percentage points between runs just from sampling noise.
A regression on an important subset (math word problems, long-context retrieval) can be invisible in the aggregate.

Eyeballing two numbers and declaring "it went down 0.3 points" is not a measurement. It is a guess.

The Approach

nel compare treats the comparison as a paired statistical test. It evaluates the same problems with both models and analyzes what changed at the individual problem level:

Pair results by (problem_idx, repeat) across baseline and candidate.
Classify each pair: did the problem flip from correct to incorrect (regression), incorrect to correct (improvement), or stay the same?
Test whether the regression count is significantly larger than would occur by chance, using McNemar's exact binomial test.
Report the effect size, confidence interval, and a human-readable verdict.

This is the same methodology described in "When LLMs Get Significantly Worse" (ICLR 2026). The key insight is that paired analysis dramatically reduces noise compared to comparing two independent averages, because the difficulty of each problem is "controlled for" -- you are measuring the change in behavior on the same inputs.

Why McNemar and Not a t-test

McNemar's test focuses on discordant pairs -- problems where the two models disagree. It ignores the (often vast majority of) problems both models get right or both get wrong. This makes it far more sensitive to targeted regressions than a t-test on overall accuracy, which dilutes the signal with concordant pairs.

What the Verdicts Mean

Verdict	Meaning	What to do
PASS	No evidence of a meaningful regression	Proceed
WARN	Statistically significant change, but below the practical threshold	Inspect the flipped problems; decide whether the affected areas matter for your use case
BLOCK	Significant regression that exceeds the configured tolerance	Investigate the regressed problems; fix the candidate or reject it
INCONCLUSIVE	Not enough paired data to detect regressions at the configured threshold	Re-run with more problems, or use a larger benchmark

Walkthrough

Step 1: Produce Baseline and Candidate Evaluations

Run the same benchmark with the same configuration against both models:

export NVIDIA_API_KEY="your-api-key-here"

# Baseline
nel eval run --bench mmlu_pro \
  --model-url https://integrate.api.nvidia.com/v1 \
  --model-id baseline-model \
  --api-key $NVIDIA_API_KEY \
  --repeats 1 \
  --max-problems 500 \
  -o ./results/baseline

# Candidate
nel eval run --bench mmlu_pro \
  --model-url https://integrate.api.nvidia.com/v1 \
  --model-id candidate-model \
  --api-key $NVIDIA_API_KEY \
  --repeats 1 \
  --max-problems 500 \
  -o ./results/candidate

For a valid comparison, match the runs on: benchmark, dataset slice, prompt template, repeat count, and max problems. Differences in any of these confound the comparison.

Step 2: Run the Comparison

nel compare ./results/baseline ./results/candidate

nel compare accepts directories (it finds the eval-*.json bundle inside) or direct bundle paths.

The command outputs:

Score deltas: how each metric changed (absolute and relative)
Flip summary: how many problems regressed, improved, or stayed the same
McNemar test: p-value, effect size, and confidence interval
Verdict: PASS, WARN, BLOCK, or INCONCLUSIVE
Markdown report: auto-generated investigation document next to the candidate bundle

Step 3: Tighten the Threshold for Sensitive Comparisons

The default practical threshold is 5% (0.05 on the 0-1 scale). For quantization or fine-tuning where 1 percentage point matters:

nel compare ./results/baseline ./results/candidate --max-drop 0.01

This tells the verdict logic: "a regression is only practically meaningful if the net effect exceeds 1 pp." Smaller effects get WARN instead of BLOCK.

Step 4: Inspect What Actually Changed

Add --show-flips to see the individual problems that flipped:

nel compare ./results/baseline ./results/candidate --show-flips --verbose

This prints:

Each regressed problem: index, category, expected answer, what baseline said, what candidate said
Each improved problem: same detail
Category breakdown: which subjects or topics were hit hardest
Statistical detail: p-value, effect size, discordant pair count, minimum detectable effect

The --verbose flag adds the statistical detail. Without it, you get the flip list and verdict but not the p-values.

Step 5: Use the Investigation Report

By default, nel compare writes regression_report.md in the candidate's result directory. This Markdown file contains:

Side-by-side model responses for each flipped problem
Category-level regression rates
Statistical summary
Suggested next steps

This report is designed for humans reviewing a merge request or a model release. Attach it to the MR or the model card review.

To write it to a specific path:

nel compare ./results/baseline ./results/candidate --report ./review/mmlu_comparison.md

To suppress it:

nel compare ./results/baseline ./results/candidate --no-report

Step 6: Use in CI Pipelines

With --strict, the command returns exit codes suitable for CI:

Exit code	Verdict
0	PASS
1	BLOCK
2	WARN or INCONCLUSIVE

nel compare ./results/baseline ./results/candidate \
  --max-drop 0.01 --strict

For machine-readable output, use --format json:

nel compare ./results/baseline ./results/candidate --format json > report.json

Step 7: Use the Python API

For programmatic access:

from nemo_evaluator.engine.comparison import compare_runs, write_regression

report = compare_runs("./results/baseline/eval-mmlu.json",
                       "./results/candidate/eval-mmlu.json")

print(report["verdict"])  # PASS / WARN / BLOCK / INCONCLUSIVE

# Per-metric deltas
for metric, d in report["score_deltas"].items():
    print(f"{metric}: {d['baseline']:.4f} -> {d['candidate']:.4f}  "
          f"(delta={d['delta']:+.4f}, {d['relative_pct']:+.1f}%)")

# Flip summary
flip = report["flip_report"]["summary"]
print(f"Regressions: {flip['n_regressions']}, "
      f"Improvements: {flip['n_improvements']}, "
      f"Paired: {flip['n_paired']}")

# McNemar test
m = report["mcnemar"]
if m.get("p_value") is not None:
    print(f"McNemar p={m['p_value']:.4f}, effect={m['effect_size']:.4f}")

write_regression(report, "comparison.json")

Reference: All Flags

Flag	Default	Purpose
`--max-drop` / `-t`	0.05	Practical effect threshold (0-1 scale)
`--strict`	off	Exit non-zero on BLOCK, WARN, or INCONCLUSIVE
`--correct-above`	0.0	Reward threshold for "correct" classification. Use `0.5` for judge-scored benchmarks where reward is a continuous score.
`--show-flips`	off	Print per-problem flip details
`--verbose`	off	Show statistical details (p-values, effect sizes, power)
`--compact`	off	Short output for Slack or CI logs
`--format`	text	`text` or `json`
`--output` / `-o`	none	Write JSON report to file
`--report`	auto	Write Markdown report (default: next to candidate bundle)
`--no-report`	off	Suppress Markdown report

Understanding Sample Size and Power

A common mistake is running a small benchmark and treating the result as definitive. The comparison's statistical power depends on the number of discordant pairs -- problems where the two models disagree.

Rule of thumb:

Discordant pairs	Minimum detectable effect (80% power)
10	~28%
50	~12.5%
100	~8.8%
500	~3.9%
1000	~2.8%

If your benchmark has 164 items (HumanEval) and 90% concordance, you might have only ~16 discordant pairs. That means you can reliably detect ~22% regression rates, not 1-2 point deltas. nel compare reports the minimum detectable effect and will return INCONCLUSIVE when the test is underpowered.

When to Use `nel compare` vs `nel gate`

Need	Tool
Diagnose what changed in one benchmark	`nel compare`
Investigate why a specific benchmark regressed	`nel compare --show-flips --verbose`
Make a release decision across a suite of benchmarks	`nel gate`
CI gate on a single benchmark	`nel compare --strict`
CI gate on multiple benchmarks with per-benchmark thresholds	`nel gate --strict`

nel compare is the diagnostic tool. nel gate is the policy enforcement tool. A typical workflow uses nel gate first, then nel compare on any failing benchmarks to understand what went wrong.