train.md
May 28, 2026 ยท View on GitHub
Train
To run training, prepare a YAML config. Below are two up-to-date examples that you can use as templates.
Following is an example config:
trainer_type: fsdp2_trainer
# Dataset configuration - now includes the actual dataset definitions
dataset_config:
dataset_type: vision
dataset_format: yaml # Uses 'yaml' format for both external files and inline definitions
# Inline dataset definitions (no dataset_path needed)
datasets:
- path: data/open_thoughts_debug
data_folder: ""
data_type: arrow
# Processor configuration
processor_config:
processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"
processor_type: "qwen2_5_vl"
# Packing configuration
packing: true
packing_strategy: first_fit
packing_length: 16384
# Model configuration
model_config:
load_from_pretrained_path: "Qwen/Qwen2.5-VL-7B-Instruct"
attn_implementation: "flash_attention_2"
# Training arguments, mostly compatible with HuggingFace Trainer
trainer_args:
per_device_train_batch_size: 1
learning_rate: 1.0e-06 # we should use 1.0 to makes YAML recognize it as a float
weight_decay: 0.0
gradient_accumulation_steps: 1
gradient_checkpointing: true
num_train_epochs: 1
save_steps: 100
save_total_limit: 1
report_to: "wandb"
output_dir: "./output/debug"
warmup_ratio: 0.0
run_name: "qwen2_5_vl_config"
eval_strategy: "no"
logging_steps: 1
group_by_length: true
dataloader_num_workers: 8
bf16: true
lr_scheduler_type: "cosine"
freeze_modules: ["visual"]
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"]
reshard_after_forward: false
You can visit the config.py file under each subfolder to see what parameters are configurable
Key fields
- trainer_type: Use
hf_trainerfor standard HF Trainer orfsdp2_trainerfor PyTorch FSDP2. - dataset_config.dataset_format:
yaml. You can either setdataset_pathto an external YAML, or embed datasets inline viadatasets. - datasets: Each entry defines
path, optionaldata_folder, anddata_type(e.g.,arrow,parquet). - processor_config: Set
processor_name(e.g., a Hugging Face model id) andprocessor_type(e.g.,qwen2_5_vl). - packing: Enable sequence packing with
packing: true, and adjustpacking_strategyandpacking_length. Usefilter_overlongto drop samples exceeding limits. - video options:
video_backend,video_sampling_strategy,video_max_pixels,video_max_framescontrol video preprocessing. - model_config: Prefer
load_from_pretrained_pathand setattn_implementation(e.g.,flash_attention_2). - freeze_modules: List of submodules (e.g.,
visual) to freeze during training. - use_liger_kernel/use_rmpad: Performance optimizations. Keep enabled if supported on your stack.
- fsdp2/fsdp_config: Enable FSDP2 sharding and wrap transformer layer classes via
transformer_layer_cls_to_wrap. Tunereshard_after_forwardfor memory/perf trade-offs. - enable_cuda_event_profiler: Enable a low-overhead CUDA event profiler for FSDP2 training. It writes per-rank JSONL files under
output_dir/cuda_event_profilerfor phases such ashost_to_device,training_step, andtraining_metrics. - EMA (Exponential Moving Average): Enable EMA with
ema_enabled: true. Configureema_decay(default 0.9999),ema_update_every,ema_start_step, and optionally filter parameters viaema_param_filter. EMA checkpoints are saved alongside regular checkpoints and can be merged usingmerge_fsdp.pywith--state_dict_dirname pytorch_ema_model_fsdp_0.
Run
Example launch command:
export NCCL_BLOCKING_WAIT=0
export TOKENIZERS_PARALLELISM=false
# Hugging Face setup (optional)
export HF_TOKEN="<YOUR HF_TOKEN>"
export HF_HOME="$HOME/.cache/huggingface"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export NCCL_DEBUG=INFO
CONFIG=\$1 # path to your YAML config
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli config_yaml=${CONFIG}
Run direct with cli and override with hydra
Instead of using a YAML config file, you can pass configuration directly via Hydra overrides on the command line. This is useful for quick experiments and parameter tuning.
Basic Usage
Use the format key=value to override any configuration parameter. Hydra automatically creates the nested structure:
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli \
trainer_type=fsdp2_trainer \
dataset_config.dataset_path=/path/to/video_dataset.yaml \
dataset_config.dataset_format=yaml \
dataset_config.dataset_type=qwen3_vl_iterable \
dataset_config.processor_config.processor_name="Qwen/Qwen3-VL-8B-Instruct" \
dataset_config.processor_config.processor_type=qwen3_vl \
model_config.load_from_pretrained_path="Qwen/Qwen3-VL-8B-Instruct" \
model_config.attn_implementation=flash_attention_2 \
trainer_args.per_device_train_batch_size=1 \
trainer_args.learning_rate=2.0e-04 \
trainer_args.num_train_epochs=1 \
trainer_args.output_dir=./output/debug \
trainer_args.bf16=true
Common Overrides
Here are frequently used parameters you can override:
Dataset Configuration:
dataset_config.dataset_path: Path to your YAML dataset configdataset_config.dataset_format: Format type (e.g.,yaml,json)dataset_config.dataset_type: Dataset type (e.g.,vision,qwen3_vl_iterable)dataset_config.processor_config.processor_name: Model name for the processordataset_config.processor_config.processor_type: Processor type to usedataset_config.packing: Enable/disable sequence packing (e.g.,packing=true)dataset_config.packing_length: Max sequence length for packingdataset_config.video_backend: Video processing backend (e.g.,qwen_vl_utils)dataset_config.video_sampling_strategy: Video sampling method (e.g.,fps)dataset_config.video_max_frames: Maximum frames per video
Model Configuration:
model_config.load_from_pretrained_path: Path or HF model ID to load frommodel_config.attn_implementation: Attention implementation (e.g.,flash_attention_2)
Training Arguments:
trainer_args.per_device_train_batch_size: Batch size per devicetrainer_args.learning_rate: Learning rate (use float notation like2.0e-04)trainer_args.num_train_epochs: Number of training epochstrainer_args.max_steps: Maximum training stepstrainer_args.gradient_accumulation_steps: Gradient accumulation stepstrainer_args.gradient_checkpointing: Enable gradient checkpointingtrainer_args.output_dir: Output directory for checkpointstrainer_args.run_name: Name for this training runtrainer_args.bf16: Use bfloat16 precisiontrainer_args.fsdp2: Enable FSDP2 distributed trainingtrainer_args.use_liger_kernel: Enable Liger kernel optimizationstrainer_args.use_rmpad: Enable padding removal optimizationtrainer_args.ema_enabled: Enable EMA (default:false)trainer_args.ema_decay: EMA decay rate (default:0.9999)trainer_args.ema_update_every: Update EMA every N steps (default:1)trainer_args.ema_start_step: Start EMA from step N (default:0)trainer_args.ema_requires_grad_only: Only apply EMA to trainable parameters (default:true)trainer_args.ema_param_filter: Filter parameters by name (supportsmode,include,exclude)trainer_args.ema_resume_from_ema: Resume training from EMA weights (default:false)trainer_args.enable_cuda_event_profiler: Enable lightweight CUDA event timing (default:false)trainer_args.cuda_event_profiler_config: Optional profiler window, rank filter, and sampling config, e.g.{start_step: 100, end_step: 1000, record_every_n_steps: 10, flush_every_n_steps: 50, ranks: [0, 1, 7]}
Lightweight CUDA Event Profiling
For long-running distributed jobs, torch.profiler traces can be too heavy to keep enabled. The CUDA event profiler records only named phase durations and writes one JSON object per completed event:
trainer_args:
enable_cuda_event_profiler: true
cuda_event_profiler_config:
start_step: 100
end_step: 1000
record_every_n_steps: 10
flush_every_n_steps: 50
ranks: [0, 1, 7]
Selected ranks write to output_dir/cuda_event_profiler/cuda_events_rank_<rank>.jsonl. These files can be aggregated into rank heatmaps or timeline views to diagnose stragglers without the synchronization overhead of full profiler traces.
The profiler is intended for diagnosis and remains disabled by default. For large jobs, prefer bounded windows, sampled steps, and rank filters instead of recording every rank on every step. If record_every_n_steps is omitted, it defaults to 10.
Advanced Example
See examples/qwen3_vl/qwen3_vl_8b_train.sh for a complete training script using Hydra overrides with comprehensive parameter configuration for multi-GPU training.
Overriding Existing YAML Config
You can use a YAML config file as a base and override specific parameters via CLI using Hydra's config-path and config-name:
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli \
--config-path /path/to/config_yaml/directory \
--config-name qwen2_5_vl_dp \
trainer_args.max_steps=100
This loads all settings from qwen2_5_vl_dp.yaml in the specified directory and only overrides the specified parameters (CLI overrides take precedence).
Tips
- Use quotes for string values:
processor_name="Qwen/Qwen2.5-VL-7B-Instruct" - Use dot notation for nested configs:
trainer_args.learning_rate=1.0e-06 - Boolean values:
packing=trueorpacking=false - For complex values (lists/arrays), use Hydra's syntax:
trainer_args.fsdp_config.transformer_layer_cls_to_wrap=["Qwen2_5_VLDecoderLayer"] - Add new parameters with
+:+dataset_config.extra_kwargs.image_max_pixels=4194304