Experiment Overview

May 28, 2025 ยท View on GitHub

  • Default: Using the default training configuration, InitStd = 0.02, RouterDtype = bfloat16, AuxCoeff = 0.02.
  • InitStd-RouterDtype-AuxCoeff: InitStd = 6e-3, RouterDtype = float32, AuxCoeff = 0.0001.
  • Add-MTP: Based on the above InitStd-RouterDtype-AuxCoeff, MTP module was added zai, with an auxiliary loss weight of 0.3.

All configs were trained with 100B tokens for performance comparison.

Evaluation

CategoryMetrics (shots)Default (100B)InitStd-RouterDtype-AuxCoeff (100B)Add-MTP (100B)
English-Commonsense ReasoningHellaSwag (5-shot)0.44140.45440.4568
TruthfulQA (0-shot)0.37350.37070.3438
Winogrande (5-shot)0.59270.57770.6062
CommonsenseQA (5-shot)0.20560.19660.2531
PIQA (5-shot)0.72740.74760.7454
OpenBookQA (5-shot)0.27600.30400.3180
BoolQ (5-shot)0.62940.64650.6471
English-Problem-SolvingARC Easy (5-shot)0.70290.73530.7264
ARC Challenge (5-shot)0.37030.39930.4053
MMLU (5-shot)0.26020.26710.3397
English-MathematicsGSM8K (5-shot)0.01820.02650.0136
Minerva Math (4-shot)0.00940.00980.0080
ChineseCEval (5-shot)0.26450.26000.3076
CMMLU (5-shot)0.24550.24750.2856
Average MetricsAverage-English(w/o Math)0.45790.46990.4842
Average-English0.38390.39460.4053
Average-Chinese0.25500.25380.2966
Average0.36550.37450.3897
Average(w/o Math)0.42410.43390.4529