Your First Training with Nanotron
March 24, 2025 ยท View on GitHub
This guide will walk you through the necessary steps to train your first model with Nanotron, a high-performance library for pretraining transformer models.
Prerequisites
Before you begin, make sure you have:
- Python 3.10 or later (but less than 3.12)
- CUDA-enabled GPU(s)
- Nanotron installed (see Installation)
Single Node Training
Training a model on a single node involves two main steps:
- Creating a configuration
- Running the training script
Step 1: Creating a Configuration
Nanotron uses YAML configuration files to define training parameters. You can either:
- Use an existing YAML config directly
- Generate a YAML config from a Python script
Option A: Using a Python Script to Generate Config
Creating a config with Python offers more flexibility and allows for programmatic configuration generation. Here's how:
-
Create a Python script similar to
examples/config_tiny_llama.py: -
Run the Python script to generate the YAML config:
python my_config.py
Option B: Use an Existing Config File
You can also use one of the provided example configurations directly, such as examples/config_tiny_llama.yaml.
Step 2: Running the Training
Once you have your configuration file ready, you can start training using torchrun:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
Where:
CUDA_DEVICE_MAX_CONNECTIONS=1: Important environment variable for some distributed operations--nproc_per_node=8: Specifies the number of processes (GPUs) you want to use. Make sure this matches DPxTPxPP parallelism sizes.run_train.py: Main training script--config-file examples/config_tiny_llama.yaml: Path to your configuration file
Additional Configuration Notes
-
Parallelism: Adjust
dp,tp, andppin the configuration based on your hardware:dp: Data Parallelism - How many replicas of your modeltp: Tensor Parallelism - How to split individual tensorspp: Pipeline Parallelism - How to split the model across stages
-
Batch Size: The global batch size is calculated as:
micro_batch_size * batch_accumulation_per_replica * dp -
Checkpointing: Set
checkpoint_intervalto control how often models are saved.
Using a Custom Dataloader
If you want to use your own dataset instead of the built-in Hugging Face datasets support, you can create a custom dataloader:
-
Set your dataset configuration to
null:data: dataset: null # Custom dataloader will be used num_loading_workers: 1 seed: 42 name: Stable Training Stage start_training_step: 1 -
Implement a custom dataloader similar to the example in
examples/custom-dataloader/run_train.py.
For detailed instructions, refer to examples/custom-dataloader/README.md.
Multi-node Training
Check out the Multi-node Training guide for more information.
Troubleshooting
If you encounter issues with token sizes in your dataloader, ensure that your tokens do not exceed the model's vocabulary size. This is a common source of errors in training.
For more help, check the Troubleshooting section in the Multi-Node Training guide.