Resource Tables

June 4, 2024 · View on GitHub

  • Last updated: 10/20/2023
  • LitGPT version: commit 8641822
  • Hardware: NVIDIA A100-SXM4-40GB
  • OS: Ubuntu 22.04.3 LTS (x86_64)
  • Nvidia driver version: 525.125.06
  • Relevant libraries
    • PyTorch 2.1.0+cu121
    • Bitsandbytes 0.41.1

This document provides an overview and examples of hardware requirements when running models in LitGPT.

For additional tips on lowering the GPU memory footprint, please also see the Dealing with out-of-memory (OOM) errors document.

All experiments were run using 16-bit brain floating point precision (--precision bf16-true). If your GPU does not support brain floating point precision, you can use regular 16-bit floating point precision (--precision 16-true).

All experiments were conducted using the Alpaca dataset with its default length. Note that due to different tokenizers being used by the different models, the number of tokens in the longest training example differs based on the model:

  • phi1.5: 1044 tokens
  • StableLM Alpha: 1034 tokens
  • Llama 2: 1304 tokens
  • Falcon 1079 tokens

Note that the number of tokens in the training set does not affect the supported context width (block size) of the models, which is as follows:

  • phi1.5: 2048 tokens
  • StableLM 3B Alpha: 4096 tokens
  • Llama 2: 4048 tokens
  • Falcon: 2048 tokens
  • CodeLlama 13B: 16384 tokens

 

Finetuning with LoRA on 1 GPU

The following experiments were conducted on 1xA100 with a minibatch size of 128 using the litgpt finetune_lora command.

SizeModelQuantizationMicrobatch sizeTrainable parametersMax GPU RAMTime 1k iterations
1.3 Bphi-1.5None11,572,8644.82 GB1.62 min
1.3 Bphi-1.5bnb.nf411,572,8643.78 GB1.77 min
1.3 Bphi-1.5bnb.nf4-dq11,572,8643.72 GB1.87 min
1.3 Bphi-1.5None21,572,8646.76 GB1.65 min
1.3 Bphi-1.5None41,572,86410.68 GB1.70 min
3 BStableLM AlphaNone12,097,1529.69 GB1.24 min
3 BStableLM Alphabnb.nf412,097,1526.35 GB1.82 min
3 BStableLM Alphabnb.nf4-dq12,097,1526.19 GB1.87 min
3 BStableLM AlphaNone22,097,15212.10 GB1.33 min
3 BStableLM AlphaNone42,097,15216.92 GB1.50 min
7 BLlama 2None14,194,30421.30 GB2.36 min
7 BLlama 2bnb.nf414,194,30414.14 GB3.68 min
7 BLlama 2bnb.nf4-dq14,194,30413.84 GB3.83 min
7 BLlama 2None24,194,30429.07 GB2.52 min
7 BLlama 2None44,194,304OOM-
13 BLlama 2None16,553,60038.12 GB3.19 min
13 BLlama 2bnb.nf416,553,60023.14 GB6.38 min
13 BLlama 2bnb.nf4-dq16,553,60022.55 GB6.55 min
13 BLlama 2None26,553,600OOM-
13 BLlama 2None46,553,600OOM-
40 BFalconNone112,042,240OOM-
40 BFalconbnb.nf4112,042,240OOM-
40 BFalconbnb.nf4-dq112,042,240OOM-

 

Finetuning with Adapter on 1 GPU

The following experiments were conducted on 1xA100 with a minibatch size of 128 using the litgpt finetune_adapter command.

SizeModelQuantizationMicrobatch sizeTrainable parametersMax GPU RAMTime 1k iterations
3 BStableLM AlphaNone1573,8889.10 GB0.74 min
3 BStableLM Alphabnb.nf41573,8885.65 GB1.38 min
3 BStableLM Alphabnb.nf4-dq1573,8885.48 GB1.46 min
7 BLlama 2None11,229,76019.98 GB1.50 min
7 BLlama 2bnb.nf411,229,76012.68 GB2.93 min
7 BLlama 2bnb.nf4-dq11,229,76012.38 GB3.00 min

The same config, but using the litgpt finetune_adapter_v2 command.

SizeModelQuantizationMicrobatch sizeTrainable parametersMax GPU RAMTime 1k iterations
3 BStableLM AlphaNone12,125,24810.71 GB0.87 min
3 BStableLM Alphabnb.nf412,125,2487.41 GB1.59 min
3 BStableLM Alphabnb.nf4-dq12,125,2487.25 GB1.62 min
7 BLlama 2None14,279,74425.51 GB1.81 min
7 BLlama 2bnb.nf414,279,74418.30 GB3.23 min
7 BLlama 2bnb.nf4-dq14,279,74417.98 GB3.32 min

 

Finetuning with LoRA on Multiple GPUs

The following experiments were conducted on multiple A100 GPUs with a minibatch size of 128 using the litgpt finetune_lora command.

SizeModelQuantizationMicrobatch sizeTrainable parametersGPUMax GPU RAMTime 1k iterations
1.3 Bphi-1.5None11,572,8642 x A1004.86 GB3.81 min
1.3 Bphi-1.5bnb.nf411,572,8642 x A100N/A-
1.3 Bphi-1.5bnb.nf4-dq11,572,8642 x A100N/A-
1.3 Bphi-1.5None21,572,8642 x A1005.05 GB3.63 min
1.3 Bphi-1.5None41,572,8642 x A1005.88 GB3.64 min
3 BStableLM AlphaNone12,097,1522 x A10012.75 GB2.92 min
3 BStableLM AlphaNone22,097,1522 x A10012.94 GB3.06 min
3 BStableLM AlphaNone42,097,1522 x A10013.45 GB3.86 min
-
7 BLlama 2None14,194,3042 x A10022.18 GB5.93 min
7 BLlama 2None24,194,3042 x A10022.47 GB6.48 min
7 BLlama 2None44,194,3042 x A10023.39 GB8.66 min
13 BLlama 2None16,553,6002 x A100OOM-
13 BLlama 2bnb.nf416,553,6002 x A100N/A-
13 BLlama 2bnb.nf4-dq16,553,6002 x A100N/A-
13 BLlama 2None16,553,6004 x A10035.57 GB10.25 min
40 BFalconNone112,042,2404 x A100OOM-

 

Single-GPU Inference

SizeModelQuantizationGPUMax GPU RAMToken/sec
1.3 Bphi-1.5None1 x A1002.86 GB42.56
1.3 Bphi-1.5bnb.nf41 x A1001.39 GB22.89
1.3 Bphi-1.5bnb.nf4-dq1 x A1001.33 GB22.75
3 BStableLM AlphaNone1 x A1007.30 GB49.01
3 BStableLM Alphabnb.nf41 x A1003.20 GB29.04
3 BStableLM Alphabnb.nf4-dq1 x A1003.04 GB27.15
7 BLlama 2None1 x A10013.52 GB30.97
7 BLlama 2bnb.nf41 x A1004.57 GB19.98
7 BLlama 2bnb.nf4-dq1 x A1004.26 GB17.3
13 BLlama 2None1 x A10026.21 GB24.82
13 BLlama 2bnb.nf41 x A1008.32 GB16.73
13 BLlama 2bnb.nf4-dq1 x A1007.72 GB14.43
34 BCodeLlamaNone1 x A100OOM-
34 BCodeLlamabnb.nf41 x A10020.52 GB14.32
34 BCodeLlamabnb.nf4-dq1 x A10018.95 GB12.37
40 BFalconNone1 x A100OOM-
40 BFalconbnb.nf41 x A10026.55 GB13.25
40 BFalconbnb.nf4-dq1 x A10024.63 GB11.64
70 BLlama 2None1 x A100OOM-
70 BLlama 2bnb.nf41 x A100CUDA error: CUBLAS_STATUS_NOT_INITIALIZED-
70 BLlama 2bnb.nf4-dq1 x A10037.21 GB7.97