PRETRAIN.md

September 19, 2023 · View on GitHub

Pretrain TinyLlama

Installation

We expect you have CUDA 11.8 installed.

Install Pytorch Nightly.

pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'

Build XFormers from Source

Note: as of 2023/09/02, xformers does not provide pre-built binaries for torch 2.1. You have to build it from source.

pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Install Flash-Attention 2 and other fused operators:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention

Install Remaining Dependencies

pip install -r requirements.txt tokenizers sentencepiece

to install other dependencies. It may take >= 5 minutes to build xformers/flash-attention. Do not worry if the process seemly stagnant or the terminal print out many warnings.

Then you are ready to go 🎉!

Data Preparation

Download Datasets

Download the Slimpajama and Starcoderdata datasets to your chosen directory.

cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
git clone https://huggingface.co/datasets/bigcode/starcoderdata

The SlimPajama dataset eats 893GB diskspace and the starcoderdata takes 290GB.

Tokenize data

Use the provided scripts to tokenize the datasets and divide them into chunks.

python scripts/prepare_starcoder.py --source_path /path/to/starcoderdata/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim_star_combined --split train --percentage 1.0

The processed data will take 1.8T storage.

Pretraining

If your setup comprises two nodes, each with 8 GPUs, you can initiate pretraining with the following commands:

On node 1:

lightning run model \
    --node-rank=0  \
    --main-address=172.16.101.5 \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=2 \
    pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star  --val_data_dir data/slim_star

On node 2:

lightning run model \
    --node-rank=1  \
    --main-address=172.16.101.5 \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=2 \
    pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star   --val_data_dir data/slim_star

You can follow these instructions if you have a slurm cluster.