Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

April 23, 2024 ยท View on GitHub

This repository contains the open-source code and benchmark results for the paper - Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward.
The benchmark assesses the performance of various compression and inference methods.

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Deepak Gupta, Merouane Debbah
Nyun AI, Transmute AI Lab, KU 6G Research Cente

Getting Started

All the experiments are performed in isolated Python3.10 environments with method-specific requirements such as package & library versions. The exact repository and branch details can be inferred from .gitmodules

Repository Organization

The repository follows a structured format with a branch naming convention of "A100<method>", where <method> denotes the specific evaluation method. The organization within each branch is outlined as follows:

  • engine/: This directory contains the implementation of engine methods along with setup, generation, and bench scripts.

  • prune/: Here, you'll find the implementation of pruning methods along with associated setup, generation, and bench scripts.

  • quant/: This directory houses the implementation of quantization methods, complete with setup, generation, and bench scripts.

  • exports/: A shared export folder is structured as exports/MODEL_NAME/METHOD_TYPE/METHOD_NAME/.

  • experiments/: This section holds formatted benchmarking results in notebooks, featuring metrics such as RM (run memory), WM (weight memory), and various GPU utilization graphs.

Please note that all setup, generation, and benchmarking scripts (.sh) strive to be kept up-to-date with the latest runs and are tailored to Python 3.10 with CUDA 12.1 (or the version necessary for the method). Adjustments to the scripts may be required, or a different script should be utilized.

Branch Overview

Discover various branches dedicated to evaluated methods within the repository:

  • A100Exllama: Investigates the Exllama engine with GPTQ quantization.

  • A100Exllamav2: Explores the latest ExllamaV2 featuring EXL2 & GPTQ quantizations.

  • A100Llamacpp: Examines the CPP implementation of the Llama architecture for enhanced speed.

  • A100MLCLLM: Explores MLCLLM, offering extensive hardware and platform support.

  • A100TGI: Investigates the Text Generation Inference toolkit, employed for LLM inferences in production.

  • A100VLLM: Explores VLLM.

  • A100TensorRTLLM: Investigates NVIDIA's TensorRTLLM inference engine.

  • A100GPTQ: Explores the GPTQ quantization method through AutoGPTQ.

  • A100HF: Investigates multiple quantization methods, alongside baseline generation speeds for each method.

  • A1000mniquant: Explores the OmniQuant quantization method.

Note: Each branch is equipped with its own set of updated scripts, which may or may not be synchronized with other branches. Additionally, specific quantization methods might lack dedicated branches; however, the corresponding scripts can be directly referenced in the respective branches or from the main branch. Models and scales directly obtained from the HF Hub were also utilized as needed.

Results Overview

Pruning

MethodSparsityRM (GB)WM (GB)Tokens/sPerplexity
Baseline-26.1612.5530.9012.62
Wanda-SP20%---22.12
Wanda-SP50%---366.43
LLM-Pruner20%10.3810.0932.5719.77
LLM-Pruner50%6.546.2340.95112.44
LLM-Pruner*20%10.3810.0932.5717.37
LLM-Pruner*50%6.546.2340.9538.12
FLaP20%9.729.4433.9014.62
FLaP50%6.266.0742.8831.80

* with fine-tuning

Quantization

MethodInference EngineWM (GB)RM (GB)Tokens/sPerplexity
Baseline FP16PyTorch12.5526.1630.905.85
GPTQ 2bitPyTorch2.112.9820.91NaN
GPTQ 3bitPyTorch2.873.8621.247.36
GPTQ 4bitPyTorch3.634.6521.636.08
GPTQ 8bitPyTorch6.677.6221.365.86
AWQ 4bit GEMMPyTorch3.684.6428.516.02
AWQ 4bit GEMVPyTorch3.684.6431.816.02
QLoRA (NF4)PyTorch3.564.8419.706.02
LLM.int8()PyTorch6.587.715.245.89
K-Quants 4bitLlama.cpp3.807.38104.455.96
OmniQuant 3bitMLC-LLM3.205.1083.46.65
OmniQuant 4bitMLC-LLM3.805.70134.25.97

Engine Results

MethodHardware SupportQuantization TypeWM (GB)RM (GB)Tokens/sPerplexity
Llama.cppNVIDIA GPUGGUF K-Quant 2bit2.363.69102.156.96
AMD GPUGGUF 4bit3.564.88128.975.96
Apple SiliconGGUF AWQ 4bit3.564.88129.255.91
CPUGGUF K-Quant 4bit3.594.90109.725.87
GGUF 8bit6.677.7893.395.79
GGUF FP1612.5513.2266.815.79
ExLlamaNVIDIA GPUGPTQ 4bit3.635.3577.106.08
AMD GPU
ExLlamav2NVIDIA GPUEXL2 2bit2.015.21153.7520.21
AMD GPUEXL2 4bit3.366.61131.686.12
GPTQ 4bit3.636.93151.306.03
EXL2 8bit6.379.47115.815.76
FP1612.5515.0967.705.73
vLLMNVIDIA GPUAWQ GEMM 4bit3.6234.55114.436.02
AMD GPUGPTQ 4bit3.6336.51172.886.08
FP1612.5535.9279.745.85
TensorRT-LLMNVIDIA GPUAWQ GEMM 4bit3.425.69194.866.02
GPTQ 4bit3.605.88202.166.08
INT86.538.55143.575.89
FP1612.5514.6183.435.85
TGINVIDIA GPUAWQ GEMM 4bit3.6236.67106.846.02
AMD GPUGPTQ 4bit3.6937.85163.226.08
Intel GPUFP412.5537.2136.916.15
AWS Inferentia2NF412.5537.2136.326.02
BF1612.5538.0373.595.89
FP1612.5538.0374.195.85
MLC-LLMNVIDIA GPUOmniQuant 3bit3.25.183.46.65
AMD GPU, CPU, WebGPUOmniQuant 4bit3.85.7134.25.97
Apple Silicon, Intel GPU, WASM, Adreno MaliFP1612.5515.3887.375.85

Acknowledgements

We extend our gratitude to the following repositories and sources for providing essential methods, engines, and datasets utilized in our benchmarking project:

  1. Llama-2-7b-hf - Hugging Face model repository for Llama-2-7b.
  2. llama.cpp - Source for llama.cpp, a key engine method used in our benchmarks.
  3. exllama - Repository for the ExLlama engine method.
  4. exllamav2 - Source for ExLlamaV2 engine method.
  5. alpaca-cleaned - Alpaca dataset on Hugging Face, utilized in our benchmarks.
  6. squeezellm - Repository for SqueezeLLM quantization method.
  7. squeezellmgradients - Repository for SqueezeLLM-gradients.
  8. omniquant - Source for OmniQuant quantization method.
  9. mlcllm - Repository for the MLC-LLM engine method.
  10. llmpruner - Source for LLM-Pruner pruning method.
  11. tensorrtllm - Source for TensorRT-LLM engine method (branch: release/0.5.0).
  12. autogptq - Repository for AutoGPTQ, offering quantization package based on the GPTQ algorithm.
  13. autoawq - Repository for AutoAWQ, implementing the AWQ algorithm for 4-bit quantization.
  14. vllm - Source for vllm package offering the inference and serving engine

These resources have been instrumental in conducting the benchmarks and evaluations. We appreciate the creators and maintainers of these repositories for their valuable contributions.