Ablations: Nemotron-Nano-9B-v2

May 28, 2026 · View on GitHub

Distillation

All experiments prune Nemotron-Nano-9B-v2 → 7B and distill with teacher = Nemotron-Nano-9B-v2 (official). The final chosen blend (30pre_70post_v1v3) is in README.md.

Note

AIME and SciCode numbers in the tables below were collected with single-shot evaluation. The main README.md reports the same model at the 80B checkpoint using avg-of-N (matching the current nemo_evaluator.yaml), so absolute values for those two columns may differ by 1–2pp from the corresponding entries below. The qualitative trends (best blend, plateau points) hold either way.

Baseline: Pre-SFT-v1 Only (no post-training data) (click to expand)

Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).

Tokens	MMLU	MMLU Pro	GPQA Diamond	LCB v6	AIME 2025	Math 500	IFEval	SciCode
19B	72.7	70.5	53.9	58.8	63.4	94.4	57.9	19.2
56B	73.3	71.9	54.3	62.0	63.8	95.0	58.7	17.9

Notes: Highest MMLU of any blend, but AIME stagnates and LCB lags. Pretraining data alone insufficient for reasoning benchmarks.

Baseline: Pure Post-Training Data (pt-v1v2) (click to expand)

100% post-training data (no pretraining data), Nemotron-v1/v2 blend.

Tokens	MMLU	MMLU Pro	GPQA Diamond	LCB v6	AIME 2025	Math 500	IFEval	SciCode
2.5B	71.0	69.3	52.6	54.8	58.2	94.1	51.7	14.4
5B	70.8	70.7	53.6	57.2	63.8	94.1	50.5	14.2
20B	69.8	71.7	54.7	57.5	64.7	94.6	41.9	13.4
40B	70.0	71.7	53.2	57.4	67.6	95.2	43.3	16.2

Notes: IFEval degrades badly at longer training (41.9 at 20B). LCB lags behind other blends.

30% Pretraining / 70% Post-Training: v1v2 Blend (click to expand)

30% Nemotron-Pretraining-SFT-v1 + 70% Nemotron-v1/v2 post-training data.

Tokens	MMLU	MMLU Pro	GPQA Diamond	LCB v6	AIME 2025	Math 500	IFEval	SciCode
2.5B	71.9	68.9	49.8	56.4	55.3	93.3	58.2	14.6
5B	—	—	—	—	—	—	—	—
20B	71.6	71.2	52.7	58.0	65.1	94.0	55.7	14.2
40B	72.7	71.1	54.0	59.7	65.5	95.2	53.8	19.2
60B	73.0	71.9	55.9	60.0	67.8	95.4	56.4	21.7
80B	73.4	72.7	54.7	61.8	70.7	95.3	57.8	19.9
100B	73.5	72.8	56.4	62.4	71.9	95.8	59.1	19.4

Notes: Best MMLU of the 30/70 blends (~1% above v3 blends). IFEval ~56–59 (lower than v3 blends). GPQA shows instability at longer runs.

30% Pretraining / 70% Post-Training: v3 Blend (click to expand)

Refined v3 blend: dropped exercism/text2sql, added Nemotron-Math-v2 part01, boosted Math to 30% total.

Tokens	MMLU	MMLU Pro	GPQA Diamond	LCB v6	AIME 2025	Math 500	IFEval	SciCode
2.5B	70.5	69.0	51.2	59.1	62.9	94.3	62.2	11.6
5B	71.0	69.8	53.0	59.4	65.0	94.4	66.8	20.3
20B	71.2	70.8	53.3	60.0	69.1	95.3	63.8	22.6
40B	71.0	71.7	54.0	62.3	71.3	95.3	66.8	17.9
60B	72.0	72.3	56.3	62.0	71.6	95.6	65.5	21.5
80B	72.3	73.0	53.9	63.0	72.4	96.2	65.5	21.3

Notes: Better AIME and LCB than blend 1 at 40B+. GPQA still unstable (53.9 at 80B). MMLU ~1% below v1v2 blend.

Blend Design Notes

Why MMLU is ~1% lower with v3 blends: The heavy reasoning-trace format (chain-of-thought, TIR) in v3 data suppresses general knowledge recall measured by MMLU. This is structural — v1v2 post-training data has a more knowledge-dense format. Upweighting Pretraining-SFT-v1 General (to 20%) partially mitigates this. Given that MMLU Pro is better with v3 blends, lower MMLU is acceptable.

Why GPQA is unstable in blend 1: Science-v1 MCQ (497M tokens) and RQA (278M tokens) are repeated ~14× over 100B training steps, causing overfitting to MCQ format. Fix in v1v3: add Nemotron-Post-Training-Dataset-v1 STEM (~60B tokens, ~0.13 epochs at 80B) as primary science source; reduce Science-v1 to low weights (3+2) for format alignment only.

Why 80B is the recommended stopping point: SciCode degrades or crashes at 100B (blend2: 1.6; AIME also degrades). Best overall profile is at 60–80B tokens.