PRETRAIN.md
September 19, 2023 ยท View on GitHub
Pretrain TinyLlama
Installation
We expect you have CUDA 11.8 installed.
Install Pytorch Nightly.
pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
Build XFormers from Source
Note: as of 2023/09/02, xformers does not provide pre-built binaries for torch 2.1. You have to build it from source.
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Install Flash-Attention 2 and other fused operators:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
Install Remaining Dependencies
pip install -r requirements.txt tokenizers sentencepiece
to install other dependencies. It may take >= 5 minutes to build xformers/flash-attention. Do not worry if the process seemly stagnant or the terminal print out many warnings.
Then you are ready to go ๐!
Data Preparation
Download Datasets
Download the Slimpajama and Starcoderdata datasets to your chosen directory.
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
git clone https://huggingface.co/datasets/bigcode/starcoderdata
The SlimPajama dataset eats 893GB diskspace and the starcoderdata takes 290GB.
Tokenize data
Use the provided scripts to tokenize the datasets and divide them into chunks.
python scripts/prepare_starcoder.py --source_path /path/to/starcoderdata/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
The processed data will take 1.8T storage.
Pretraining
If your setup comprises two nodes, each with 8 GPUs, you can initiate pretraining with the following commands:
On node 1:
lightning run model \
--node-rank=0 \
--main-address=172.16.101.5 \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star --val_data_dir data/slim_star
On node 2:
lightning run model \
--node-rank=1 \
--main-address=172.16.101.5 \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star --val_data_dir data/slim_star
You can follow these instructions if you have a slurm cluster.