Resource Tables

June 4, 2024 · View on GitHub

Last updated: 10/20/2023
LitGPT version: commit 8641822
Hardware: NVIDIA A100-SXM4-40GB
OS: Ubuntu 22.04.3 LTS (x86_64)
Nvidia driver version: 525.125.06
Relevant libraries
- PyTorch 2.1.0+cu121
- Bitsandbytes 0.41.1

This document provides an overview and examples of hardware requirements when running models in LitGPT.

For additional tips on lowering the GPU memory footprint, please also see the Dealing with out-of-memory (OOM) errors document.

All experiments were run using 16-bit brain floating point precision (--precision bf16-true). If your GPU does not support brain floating point precision, you can use regular 16-bit floating point precision (--precision 16-true).

All experiments were conducted using the Alpaca dataset with its default length. Note that due to different tokenizers being used by the different models, the number of tokens in the longest training example differs based on the model:

phi1.5: 1044 tokens
StableLM Alpha: 1034 tokens
Llama 2: 1304 tokens
Falcon 1079 tokens

Note that the number of tokens in the training set does not affect the supported context width (block size) of the models, which is as follows:

phi1.5: 2048 tokens
StableLM 3B Alpha: 4096 tokens
Llama 2: 4048 tokens
Falcon: 2048 tokens
CodeLlama 13B: 16384 tokens

Finetuning with LoRA on 1 GPU

The following experiments were conducted on 1xA100 with a minibatch size of 128 using the litgpt finetune_lora command.

Size	Model	Quantization	Microbatch size	Trainable parameters	Max GPU RAM	Time 1k iterations
1.3 B	phi-1.5	None	1	1,572,864	4.82 GB	1.62 min
1.3 B	phi-1.5	bnb.nf4	1	1,572,864	3.78 GB	1.77 min
1.3 B	phi-1.5	bnb.nf4-dq	1	1,572,864	3.72 GB	1.87 min
1.3 B	phi-1.5	None	2	1,572,864	6.76 GB	1.65 min
1.3 B	phi-1.5	None	4	1,572,864	10.68 GB	1.70 min

3 B	StableLM Alpha	None	1	2,097,152	9.69 GB	1.24 min
3 B	StableLM Alpha	bnb.nf4	1	2,097,152	6.35 GB	1.82 min
3 B	StableLM Alpha	bnb.nf4-dq	1	2,097,152	6.19 GB	1.87 min
3 B	StableLM Alpha	None	2	2,097,152	12.10 GB	1.33 min
3 B	StableLM Alpha	None	4	2,097,152	16.92 GB	1.50 min

7 B	Llama 2	None	1	4,194,304	21.30 GB	2.36 min
7 B	Llama 2	bnb.nf4	1	4,194,304	14.14 GB	3.68 min
7 B	Llama 2	bnb.nf4-dq	1	4,194,304	13.84 GB	3.83 min
7 B	Llama 2	None	2	4,194,304	29.07 GB	2.52 min
7 B	Llama 2	None	4	4,194,304	OOM	-

13 B	Llama 2	None	1	6,553,600	38.12 GB	3.19 min
13 B	Llama 2	bnb.nf4	1	6,553,600	23.14 GB	6.38 min
13 B	Llama 2	bnb.nf4-dq	1	6,553,600	22.55 GB	6.55 min
13 B	Llama 2	None	2	6,553,600	OOM	-
13 B	Llama 2	None	4	6,553,600	OOM	-

40 B	Falcon	None	1	12,042,240	OOM	-
40 B	Falcon	bnb.nf4	1	12,042,240	OOM	-
40 B	Falcon	bnb.nf4-dq	1	12,042,240	OOM	-

Finetuning with Adapter on 1 GPU

The following experiments were conducted on 1xA100 with a minibatch size of 128 using the litgpt finetune_adapter command.

Size	Model	Quantization	Microbatch size	Trainable parameters	Max GPU RAM	Time 1k iterations
3 B	StableLM Alpha	None	1	573,888	9.10 GB	0.74 min
3 B	StableLM Alpha	bnb.nf4	1	573,888	5.65 GB	1.38 min
3 B	StableLM Alpha	bnb.nf4-dq	1	573,888	5.48 GB	1.46 min

7 B	Llama 2	None	1	1,229,760	19.98 GB	1.50 min
7 B	Llama 2	bnb.nf4	1	1,229,760	12.68 GB	2.93 min
7 B	Llama 2	bnb.nf4-dq	1	1,229,760	12.38 GB	3.00 min

The same config, but using the litgpt finetune_adapter_v2 command.

Size	Model	Quantization	Microbatch size	Trainable parameters	Max GPU RAM	Time 1k iterations
3 B	StableLM Alpha	None	1	2,125,248	10.71 GB	0.87 min
3 B	StableLM Alpha	bnb.nf4	1	2,125,248	7.41 GB	1.59 min
3 B	StableLM Alpha	bnb.nf4-dq	1	2,125,248	7.25 GB	1.62 min

7 B	Llama 2	None	1	4,279,744	25.51 GB	1.81 min
7 B	Llama 2	bnb.nf4	1	4,279,744	18.30 GB	3.23 min
7 B	Llama 2	bnb.nf4-dq	1	4,279,744	17.98 GB	3.32 min

Finetuning with LoRA on Multiple GPUs

The following experiments were conducted on multiple A100 GPUs with a minibatch size of 128 using the litgpt finetune_lora command.

Size	Model	Quantization	Microbatch size	Trainable parameters	GPU	Max GPU RAM	Time 1k iterations
1.3 B	phi-1.5	None	1	1,572,864	2 x A100	4.86 GB	3.81 min
1.3 B	phi-1.5	bnb.nf4	1	1,572,864	2 x A100	N/A	-
1.3 B	phi-1.5	bnb.nf4-dq	1	1,572,864	2 x A100	N/A	-
1.3 B	phi-1.5	None	2	1,572,864	2 x A100	5.05 GB	3.63 min
1.3 B	phi-1.5	None	4	1,572,864	2 x A100	5.88 GB	3.64 min

3 B	StableLM Alpha	None	1	2,097,152	2 x A100	12.75 GB	2.92 min
3 B	StableLM Alpha	None	2	2,097,152	2 x A100	12.94 GB	3.06 min
3 B	StableLM Alpha	None	4	2,097,152	2 x A100	13.45 GB	3.86 min
							-
7 B	Llama 2	None	1	4,194,304	2 x A100	22.18 GB	5.93 min
7 B	Llama 2	None	2	4,194,304	2 x A100	22.47 GB	6.48 min
7 B	Llama 2	None	4	4,194,304	2 x A100	23.39 GB	8.66 min

13 B	Llama 2	None	1	6,553,600	2 x A100	OOM	-
13 B	Llama 2	bnb.nf4	1	6,553,600	2 x A100	N/A	-
13 B	Llama 2	bnb.nf4-dq	1	6,553,600	2 x A100	N/A	-

13 B	Llama 2	None	1	6,553,600	4 x A100	35.57 GB	10.25 min
40 B	Falcon	None	1	12,042,240	4 x A100	OOM	-

Single-GPU Inference

Size	Model	Quantization	GPU	Max GPU RAM	Token/sec
1.3 B	phi-1.5	None	1 x A100	2.86 GB	42.56
1.3 B	phi-1.5	bnb.nf4	1 x A100	1.39 GB	22.89
1.3 B	phi-1.5	bnb.nf4-dq	1 x A100	1.33 GB	22.75

3 B	StableLM Alpha	None	1 x A100	7.30 GB	49.01
3 B	StableLM Alpha	bnb.nf4	1 x A100	3.20 GB	29.04
3 B	StableLM Alpha	bnb.nf4-dq	1 x A100	3.04 GB	27.15

7 B	Llama 2	None	1 x A100	13.52 GB	30.97
7 B	Llama 2	bnb.nf4	1 x A100	4.57 GB	19.98
7 B	Llama 2	bnb.nf4-dq	1 x A100	4.26 GB	17.3

13 B	Llama 2	None	1 x A100	26.21 GB	24.82
13 B	Llama 2	bnb.nf4	1 x A100	8.32 GB	16.73
13 B	Llama 2	bnb.nf4-dq	1 x A100	7.72 GB	14.43

34 B	CodeLlama	None	1 x A100	OOM	-
34 B	CodeLlama	bnb.nf4	1 x A100	20.52 GB	14.32
34 B	CodeLlama	bnb.nf4-dq	1 x A100	18.95 GB	12.37

40 B	Falcon	None	1 x A100	OOM	-
40 B	Falcon	bnb.nf4	1 x A100	26.55 GB	13.25
40 B	Falcon	bnb.nf4-dq	1 x A100	24.63 GB	11.64

70 B	Llama 2	None	1 x A100	OOM	-
70 B	Llama 2	bnb.nf4	1 x A100	CUDA error: CUBLAS_STATUS_NOT_INITIALIZED	-
70 B	Llama 2	bnb.nf4-dq	1 x A100	37.21 GB	7.97