This document provides a comprehensive reference for all training and evaluation parameters available in ERNIEKit. It covers:
- Basic model configuration and training setup
- Evaluation metrics and strategies
- Performance optimization techniques
- Distributed training configurations
- Memory optimization options
- Checkpoint saving strategies
- Acceleration methods
- Mixed precision training settings
- Specialized configurations for SFT, LoRA, DPO and FP8 training
Each parameter is documented with its type, default value and detailed description to help developers properly configure their training jobs.
| Parameter | Type | Default | Description |
|---|
model_name_or_path | str | Required | Model name or local model path for the model and tokenizer |
hidden_dropout_prob | float | 0.0 | Dropout probability for hidden layers |
attention_probs_dropout_prob | float | 0.0 | Dropout probability for attention layers |
dropout_warmup_steps | int | 0 | Warmup steps for dropout. Dropout probability increases linearly during warmup and disables afterward. Set to 0 to disable dropout. |
weight_quantize_algo | str | Required | Model quantization algorithm. Options: weight_only_mix (expert weights as int4, other linear layers as int8) or weight_only_int8 (all linear layers as int8) or fp8_linear |
output_dir | str | Required | Directory to save model files, checkpoints, tokenizers, and evaluation results |
logging_steps | int | Required | Logging interval. Decrease for more frequent updates. |
logging_dir | str | Required | Log directory (defaults to output_dir if unspecified) |
do_eval | bool | False | Enable model evaluation |
do_train | bool | False | Enable training |
disable_tqdm | bool | False | Disable tqdm progress bar for estimating total training time |
continue_training | bool | True | Load pretrained weights to continue training |
from_hf_hub | bool | False | Downloading model from HuggingFace Hub |
from_aistudio | bool | False | Downloading model from Aistudio |
from_modelscope | bool | False | Downloading model from ModelScope |
| Parameter | Type | Default | Description |
|---|
per_device_eval_batch_size | int | Required | Evaluation batch size (micro batch size) |
eval_dataset_path | str | Required | Path to evaluation dataset (see sft-eval.jsonl |
eval_dataset_prob | str | 1.0 | Evaluation dataset sampling probability. |
eval_dataset_type | str | erniekit | Evaluation dataset type. |
eval_steps | int | Required | Evaluation interval steps |
evaluation_strategy | str | "steps" | Evaluation strategy. "steps" enables periodic evaluation |
max_evaluate_steps | int | 1 | Maximum steps per evaluation (if positive) |
| Parameter | Type | Default | Description |
|---|
train_dataset_path | str | Required | Training dataset path (see sft-train.jsonl) |
train_dataset_prob | str | 1.0 | Training dataset sampling probability. |
train_dataset_type | str | erniekit | Training dataset type. |
max_steps | int | Required | Maximum training steps (overrides num_train_epochs if set) |
num_train_epochs | int | Required | Training epochs |
per_device_train_batch_size | int | Required | Training batch size (micro batch size). Global batch size = DP sharding micro_batch_size * gradient_accumulation_steps |
gradient_accumulation_steps | int | Required | Gradient accumulation steps |
weight_decay | float | 0.0 | AdamW optimizer weight decay |
seed | int | 42 | Random seed |
max_seq_len | int | Required | Maximum token length. Reduce if OOM occurs when increasing GBS. |
learning_rate | float | Required | Learning rate (SFT: 3e-5, DPO: 1e-6, SFT-LoRA: 3e-4, DPO-LoRA: 1e-5) |
warmup_steps | int | Required | Warmup steps (typically 1%-10% of max_steps) |
lr_scheduler_type | str | linear | Learning rate scheduler (linear/cosine/polynomial/constant/constant_with_warmup) |
min_lr | float | 0.0 | Minimum learning rate (cosine scheduler only) |
layerwise_lr_decay_bound | float | 1.0 | Layerwise LR decay factor (0,1]. 1 means no decay. |
random_shuffle | bool | True | Enable dataset shuffling |
num_cycles | float | 0.5 | Cosine scheduler: number of waves |
lr_end | float | 1e-7 | Polynomial scheduler: final LR |
power | float | 1.0 | Polynomial scheduler: power |
adam_beta1 | float | 0.9 | AdamW beta1 |
adam_beta2 | float | 0.999 | AdamW beta2 |
adam_epsilon | float | 1e-8 | AdamW epsilon |
| Parameter | Type | Default | Description |
|---|
tensor_parallel_degree | int | Required | Tensor parallelism degree |
tensor_parallel_config | str | Required | Recommended: "sync_param sync_grad sync_moment" |
tensor_parallel_output | bool | True | Enable parallel output for last Transformer layer to save memory |
pipeline_parallel_degree | int | Required | Pipeline parallelism degree |
pipeline_parallel_config | str | Required | Recommended: "disable_partial_send_recv enable_clear_every_step_cache enable_delay_scale_loss enable_overlap_p2p_comm best_unbalanced_scheduler" |
pp_seg_method | str | Required | Pipeline layer segmentation method |
virtual_pp_degree | int | 1 | Virtual pipeline degree (effective when pipeline_parallel_degree > 1) |
add_tail_layers | int | 0 | Add EmptyLayers after DecodeLayer for virtual pipeline requirements |
sharding_parallel_degree | int | Required | Sharding parallelism degree |
sharding_parallel_config | str | Required | Recommended: "enable_stage1_overlap enable_release_grads" |
sharding | str | Required | Sharding stage (stage1: optimizer, stage2: gradients, stage3: parameters) |
sequence_parallel | bool | True | Enable sequence parallelism |
moe_group | str | "dummy" | MoE communication group ("mp" for training, "dummy" for inference) |
| Parameter | Type | Default | Description |
|---|
release_grads | bool | False | Release gradients after each iteration to reduce peak memory |
use_sparse_head_and_loss_fn | bool | False | Use sparse LM Head and loss function |
use_fused_head_and_loss_fn | bool | False | Fuse LM head and CrossEntropyLoss to save memory |
use_attn_mask_startend_row_indices | bool | True | Use sparse mask representation with start row indices |
recompute_use_reentrant | bool | False | Recompute implementation (PyLayer if True, hooks if False) |
recompute | bool | False | Enable gradient checkpointing |
recompute_granularity | str | "full" | Recompute granularity ("full"/"full_attn"/"core_attn") |
offload_optim | bool | False | Offload optimizer to CPU |
| Parameter | Type | Default | Description |
|---|
save_steps | int | Required | Checkpoint save interval (when save_strategy=="steps") |
save_strategy | str | "no" | Checkpoint save strategy |
unified_checkpoint | bool | True | Use unified checkpoint format |
unified_checkpoint_config | str | "" | See Unified Checkpoint |
disable_ckpt_quant | bool | False | See Unified Checkpoint |
ignore_save_lr_and_optim | bool | False | Skip saving optimizer states |
ignore_load_lr_and_optim | bool | False | Skip loading optimizer states |
save_total_limit | int | None | Maximum number of checkpoints to keep |
| Parameter | Type | Default | Description |
|---|
use_flash_attention | bool | True | Enable FlashAttention |
use_sparse_flash_attn | bool | True | Enable FlashMask (requires use_attn_mask_startend_row_indices) |
fuse_rope | bool | False | Fuse rotary position embedding |
fuse_linear | bool | False | F fuse linear operations |
greedy_intokens | bool | True | Enable greedy token-based packing. Instead of sequential sampling, a global buffer of samples is maintained and greedily packed into sequences to maximize token utilization and minimize padding. |
dataloader_num_workers | int | 1 | Dataloader subprocess count (0 to disable) |
distributed_dataloader | int | 0 | Use distributed dataloader for large datasets |
moe_multimodal_dispatch_use_allgather | str | v2-alltoall-unpad | Optimize MoE layer with allgather+unpad |
| Parameter | Type | Default | Description |
|---|
bf16 | bool | False | Enable BF16 training |
fp16_opt_level | str | O1 | AMP level (O2 converts params to float16/bfloat16) |
scale_loss | int | 2 ** 15 | Loss scaling factor for float16 |
amp_custom_white_list | str | Required | AMP O2 whitelist (e.g., "lookup_table flash_attn matmul") |
amp_custom_black_list | str | Required | AMP O2 blacklist (e.g., "reduce_sum elementwise_div") |
amp_master_grad | bool | False | Maintain float32 gradients for AMP O2 |
| Parameter | Type | Default | Description |
|---|
num_samples_each_epoch | int | 6000000 | Virtual epoch size (recommend keeping default) |
| Parameter | Type | Default | Description |
|---|
lora_rank | int | 8 | LoRA rank (typical: 8/16/32. Higher improves quality but increases memory) |
lora_alpha | float | -1 | LoRA scaling factor (scaling = alpha/rank or alpha/sqrt(rank) for rslora) |
rslora | bool | False | Enable rslora scaling (recommended for rank ≥64) |
lora_plus_scale | float | 1 | LoRA+ learning rate multiplier (recommended: 4-16) |
rslora_plus | bool | False | Enhanced LoRA (improves performance but may cause forgetting) |
lora | bool | False | Enable LoRA training |
| Parameter | Type | Default | Description |
|---|
beta | float | 0.1 | DPO loss temperature |
simpo_gamma | float | 0.5 | SimPO loss gamma |
offset_alpha | float | 0.0 | Score-based DPO loss offset |
max_prompt_len | int | 2048 | Maximum prompt length (truncated beyond max_seq_len-10) |
loss_type | str | sigmoid | Preference loss type (sigmoid/ipo/kto_pair) |
pref_loss_ratio | float | 1.0 | Preference loss weight |
sft_loss_ratio | float | 0.0 | Chosen data SFT loss weight |
label_smoothing | float | 0.0 | Label smoothing for sigmoid loss |
reference_free | bool | False | Disable reference model |
ref_model_update_steps | int | -1 | Reference model update interval (-1 to disable) |
| Parameter | Type | Default | Description |
|---|
apply_hadamard | bool | True | Use Hadamard transform for FP8 precision |
use_lowprecision_moment | bool | False | Use BF16 optimizer momentum (recommended for FP8) |
tensorwise_offload_optimizer | bool | False | Offload optimizer to reduce memory |
apply_online_actscale_step | bool | 200 | Dynamic quantization scale steps |
optim_shard_num | int | 1 | Split optimizer state files during saving to avoid memory OOM. Works only when unified_checkpoint_config: ignore_merge_optimizer. |