Multi-Node Training with Nanotron
March 20, 2025 · View on GitHub
This guide explains how to train models with Nanotron across multiple compute nodes using Slurm, a popular workload manager for high-performance computing (HPC) clusters.
Using the Slurm Launcher
Nanotron provides a convenient script (slurm_launcher.py) to simplify launching multi-node training jobs on Slurm clusters. This script handles configuration generation, resource allocation, and job submission in one step.
Basic Usage
python slurm_launcher.py --run_name my_experiment --nodes 4 --model_size base
This will:
- Generate a Nanotron configuration file based on your parameters
- Create a Slurm job script with appropriate settings
- Submit the job to the Slurm scheduler
- Save everything needed for reproducibility
Important Parameters
The launcher supports many parameters, organized into logical groups:
Required Parameters
--run_name: Name for your experiment (will be used in logs and checkpoints)
Slurm Configuration
--nodes: Number of nodes to use (default: 2)--gpus_per_node: Number of GPUs per node (default: 8)--partition: Slurm partition to use (default: "hopper-prod")--qos: Slurm QOS to use (default: "normal")--time_limit: Time limit for the job in HH:MM:SS format (default: "1:00:00")--email: Email address for job notifications--tmp_dir: Temporary directory on compute nodes (default: "/tmp")--pre_launch_commands: Commands to run before job launch--extra_env: Additional environment variables to set
Model Configuration
--model_size: Predefined size (tiny,small,base,large)--hidden_size,--intermediate_size,--num_layers, etc.: Custom model dimensions
Training Configuration
--seed: Random seed for reproducibility (default: 42)--train_steps: Number of training steps (default: 10000)--micro_batch_size: Size of micro batches (default: 2)--grad_accum_steps: Gradient accumulation steps (default: 8)--learning_rate: Peak learning rate (default: 3e-4)--min_lr: Minimum learning rate for decay (default: 3e-5)--weight_decay: Weight decay (default: 0.01)--grad_clip: Gradient clipping (default: 1.0)--warmup_steps: Learning rate warmup steps (default: 1000)
Parallelism Strategy
--dp: Data parallelism (DP) degree (default: 8)--pp: Pipeline parallelism (PP) degree (default: 1)--tp: Tensor parallelism (TP) degree (default: 2)
Note: Make sure that DP × PP × TP does not exceed your total number of GPUs (nodes × gpus_per_node).
Dataset Configuration
--dataset: Hugging Face dataset name or path (default: "stas/openwebtext-10k")--text_column: Column name for text in the dataset (default: "text")--tokenizer: Tokenizer name or path (default: "robot-test/dummy-tokenizer-wordlevel")
Path Configuration
--project: Project name for logging (default: "nanotron")--configs_path: Where to save configurations (default: "logs/configs")--slurm_logs_path: Where to save Slurm output logs (default: "logs/slurm_logs")--checkpoints_path: Where to save model checkpoints (default: "checkpoints")--slurm_scripts_dir: Directory to save generated Slurm scripts (default: "logs/slurm_scripts")--run_train_script: Custom training script path (default: "run_train.py")--save_interval: Interval for saving checkpoints in steps (default: 1000)--save_initial_state: Save initial model state before training
Logging Configuration
--wandb_disabled: Disable logging to Weights & Biases--profiler_export_path: Path to export the profiler tensorboard data
Execution Control
--dry_run: Generate configs but don't submit job--show_logs: Show output of the job as it runs
Examples
Training a Small Model for Testing
python slurm_launcher.py \
--run_name quick_test \
--nodes 2 \
--model_size tiny \
--train_steps 100 \
--dataset stas/openwebtext-10k
Training a Large Model for Production
python slurm_launcher.py \
--run_name production_run \
--nodes 8 \
--model_size large \
--dp 4 \
--pp 2 \
--tp 2 \
--train_steps 50000 \
--learning_rate 2e-4 \
--warmup_steps 2000 \
--dataset my_dataset \
--tokenizer my_tokenizer \
--email researcher@example.com \
--time_limit 72:00:00
Note: In this example, we're using 16 GPUs for training (4×2×2 = 16) out of 64 available GPUs (8 nodes × 8 GPUs).
Custom Model Architecture
python slurm_launcher.py \
--run_name custom_arch \
--nodes 4 \
--hidden_size 1536 \
--num_layers 24 \
--num_heads 16 \
--num_kv_heads 4 \
--train_steps 20000
Dry Run (Generate Config Without Submitting)
python slurm_launcher.py \
--run_name test_config \
--model_size base \
--dry_run
Using a Custom Training Script
python slurm_launcher.py \
--run_name custom_script \
--nodes 2 \
--model_size base \
--run_train_script path/to/my_custom_train.py \
--slurm_scripts_dir slurm_scripts
Monitoring Training Output in Real-time
python slurm_launcher.py \
--run_name monitored_run \
--nodes 2 \
--model_size base \
--show_logs
Manual Multi-Node Configuration
If you prefer to set up multi-node training manually, follow these steps:
-
Create a Nanotron configuration file (YAML or Python)
-
Set appropriate parallelism parameters:
parallelism = ParallelismArgs( dp=8, # Adjust based on (total_gpus / (pp * tp)) pp=2, # Pipeline parallelism degree tp=2, # Tensor parallelism degree pp_engine="1f1b", tp_mode="REDUCE_SCATTER", tp_linear_async_communication=True, ) -
Create a Slurm batch script:
#!/bin/bash #SBATCH --job-name=nanotron-training #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --partition=your_partition #SBATCH --output=logs/%x-%j.out export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"` export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) export MASTER_PORT=6000 export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l` export TMPDIR=/tmp export CUDA_DEVICE_MAX_CONNECTIONS=1 srun bash -c "torchrun \ --nproc_per_node 8 \ --nnodes $COUNT_NODE \ --rdzv_backend c10d \ --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ --max_restarts 0 \ run_train.py --config-file your_config.yaml" -
Submit the job:
sbatch your_job_script.sh
Tips for Multi-Node Training
-
Node Communication: Ensure your cluster has a high-speed interconnect (like InfiniBand) for efficient multi-node communication.
-
Balanced Parallelism:
- For small models (< 1B parameters): Focus on data parallelism
- For medium models (1-10B): Use TP=2, PP=1 or PP=2
- For large models (>10B): Increase both TP and PP
-
Fault Tolerance: Configure
--save_intervalto save regularly in case of job failures. -
Monitoring: Use
--show_logsto monitor training progress in real-time. -
Resource Efficiency: Balance your parallelism settings (--dp, --pp, --tp) to maximize GPU utilization.
-
Environment Variables: Use
--extra_envto set additional environment variables like NCCL settings.
Troubleshooting
Common Issues
-
GPU Communication Errors
- Check network connectivity between nodes
- Try setting
CUDA_DEVICE_MAX_CONNECTIONS=1 - Use environment variables like
NCCL_DEBUG=WARNfor debugging
-
Out of Memory Errors
- Reduce batch size or sequence length
- Increase pipeline parallelism
- Consider gradient checkpointing
-
Job Timeouts
- Increase
--time_limitparameter - Set appropriate checkpointing intervals with
--save_interval
- Increase
For more detailed information, refer to the Nanotron documentation and your cluster's specific Slurm documentation.