Ablations: Nemotron-Nano-9B-v2

May 28, 2026 · View on GitHub

Distillation

All experiments prune Nemotron-Nano-9B-v2 → 7B and distill with teacher = Nemotron-Nano-9B-v2 (official). The final chosen blend (30pre_70post_v1v3) is in README.md.

Note

AIME and SciCode numbers in the tables below were collected with single-shot evaluation. The main README.md reports the same model at the 80B checkpoint using avg-of-N (matching the current nemo_evaluator.yaml), so absolute values for those two columns may differ by 1–2pp from the corresponding entries below. The qualitative trends (best blend, plateau points) hold either way.


Baseline: Pre-SFT-v1 Only (no post-training data) (click to expand)

Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).

TokensMMLUMMLU ProGPQA DiamondLCB v6AIME 2025Math 500IFEvalSciCode
19B72.770.553.958.863.494.457.919.2
56B73.371.954.362.063.895.058.717.9

Notes: Highest MMLU of any blend, but AIME stagnates and LCB lags. Pretraining data alone insufficient for reasoning benchmarks.


Baseline: Pure Post-Training Data (pt-v1v2) (click to expand)

100% post-training data (no pretraining data), Nemotron-v1/v2 blend.

TokensMMLUMMLU ProGPQA DiamondLCB v6AIME 2025Math 500IFEvalSciCode
2.5B71.069.352.654.858.294.151.714.4
5B70.870.753.657.263.894.150.514.2
20B69.871.754.757.564.794.641.913.4
40B70.071.753.257.467.695.243.316.2

Notes: IFEval degrades badly at longer training (41.9 at 20B). LCB lags behind other blends.


30% Pretraining / 70% Post-Training: v1v2 Blend (click to expand)

30% Nemotron-Pretraining-SFT-v1 + 70% Nemotron-v1/v2 post-training data.

TokensMMLUMMLU ProGPQA DiamondLCB v6AIME 2025Math 500IFEvalSciCode
2.5B71.968.949.856.455.393.358.214.6
5B
20B71.671.252.758.065.194.055.714.2
40B72.771.154.059.765.595.253.819.2
60B73.071.955.960.067.895.456.421.7
80B73.472.754.761.870.795.357.819.9
100B73.572.856.462.471.995.859.119.4

Notes: Best MMLU of the 30/70 blends (~1% above v3 blends). IFEval ~56–59 (lower than v3 blends). GPQA shows instability at longer runs.


30% Pretraining / 70% Post-Training: v3 Blend (click to expand)

Refined v3 blend: dropped exercism/text2sql, added Nemotron-Math-v2 part01, boosted Math to 30% total.

TokensMMLUMMLU ProGPQA DiamondLCB v6AIME 2025Math 500IFEvalSciCode
2.5B70.569.051.259.162.994.362.211.6
5B71.069.853.059.465.094.466.820.3
20B71.270.853.360.069.195.363.822.6
40B71.071.754.062.371.395.366.817.9
60B72.072.356.362.071.695.665.521.5
80B72.373.053.963.072.496.265.521.3

Notes: Better AIME and LCB than blend 1 at 40B+. GPQA still unstable (53.9 at 80B). MMLU ~1% below v1v2 blend.


Blend Design Notes

Why MMLU is ~1% lower with v3 blends: The heavy reasoning-trace format (chain-of-thought, TIR) in v3 data suppresses general knowledge recall measured by MMLU. This is structural — v1v2 post-training data has a more knowledge-dense format. Upweighting Pretraining-SFT-v1 General (to 20%) partially mitigates this. Given that MMLU Pro is better with v3 blends, lower MMLU is acceptable.

Why GPQA is unstable in blend 1: Science-v1 MCQ (497M tokens) and RQA (278M tokens) are repeated ~14× over 100B training steps, causing overfitting to MCQ format. Fix in v1v3: add Nemotron-Post-Training-Dataset-v1 STEM (~60B tokens, ~0.13 epochs at 80B) as primary science source; reduce Science-v1 to low weights (3+2) for format alignment only.

Why 80B is the recommended stopping point: SciCode degrades or crashes at 100B (blend2: 1.6; AIME also degrades). Best overall profile is at 60–80B tokens.