Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

April 23, 2024 · View on GitHub

This repository contains the open-source code and benchmark results for the paper - Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward.
The benchmark assesses the performance of various compression and inference methods.

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Deepak Gupta, Merouane Debbah
Nyun AI, Transmute AI Lab, KU 6G Research Cente

Getting Started

All the experiments are performed in isolated Python3.10 environments with method-specific requirements such as package & library versions. The exact repository and branch details can be inferred from .gitmodules

Repository Organization

The repository follows a structured format with a branch naming convention of "A100<method>", where <method> denotes the specific evaluation method. The organization within each branch is outlined as follows:

engine/: This directory contains the implementation of engine methods along with setup, generation, and bench scripts.
prune/: Here, you'll find the implementation of pruning methods along with associated setup, generation, and bench scripts.
quant/: This directory houses the implementation of quantization methods, complete with setup, generation, and bench scripts.
exports/: A shared export folder is structured as exports/MODEL_NAME/METHOD_TYPE/METHOD_NAME/.
experiments/: This section holds formatted benchmarking results in notebooks, featuring metrics such as RM (run memory), WM (weight memory), and various GPU utilization graphs.

Please note that all setup, generation, and benchmarking scripts (.sh) strive to be kept up-to-date with the latest runs and are tailored to Python 3.10 with CUDA 12.1 (or the version necessary for the method). Adjustments to the scripts may be required, or a different script should be utilized.

Branch Overview

Discover various branches dedicated to evaluated methods within the repository:

A100Exllama: Investigates the Exllama engine with GPTQ quantization.
A100Exllamav2: Explores the latest ExllamaV2 featuring EXL2 & GPTQ quantizations.
A100Llamacpp: Examines the CPP implementation of the Llama architecture for enhanced speed.
A100MLCLLM: Explores MLCLLM, offering extensive hardware and platform support.
A100TGI: Investigates the Text Generation Inference toolkit, employed for LLM inferences in production.
A100VLLM: Explores VLLM.
A100TensorRTLLM: Investigates NVIDIA's TensorRTLLM inference engine.
A100GPTQ: Explores the GPTQ quantization method through AutoGPTQ.
A100HF: Investigates multiple quantization methods, alongside baseline generation speeds for each method.
A1000mniquant: Explores the OmniQuant quantization method.

Note: Each branch is equipped with its own set of updated scripts, which may or may not be synchronized with other branches. Additionally, specific quantization methods might lack dedicated branches; however, the corresponding scripts can be directly referenced in the respective branches or from the main branch. Models and scales directly obtained from the HF Hub were also utilized as needed.

Results Overview

Pruning

Method	Sparsity	RM (GB)	WM (GB)	Tokens/s	Perplexity
Baseline	-	26.16	12.55	30.90	12.62
Wanda-SP	20%	-	-	-	22.12
Wanda-SP	50%	-	-	-	366.43
LLM-Pruner	20%	10.38	10.09	32.57	19.77
LLM-Pruner	50%	6.54	6.23	40.95	112.44
LLM-Pruner*	20%	10.38	10.09	32.57	17.37
LLM-Pruner*	50%	6.54	6.23	40.95	38.12
FLaP	20%	9.72	9.44	33.90	14.62
FLaP	50%	6.26	6.07	42.88	31.80

* with fine-tuning

Quantization

Method	Inference Engine	WM (GB)	RM (GB)	Tokens/s	Perplexity
Baseline FP16	PyTorch	12.55	26.16	30.90	5.85
GPTQ 2bit	PyTorch	2.11	2.98	20.91	NaN
GPTQ 3bit	PyTorch	2.87	3.86	21.24	7.36
GPTQ 4bit	PyTorch	3.63	4.65	21.63	6.08
GPTQ 8bit	PyTorch	6.67	7.62	21.36	5.86
AWQ 4bit GEMM	PyTorch	3.68	4.64	28.51	6.02
AWQ 4bit GEMV	PyTorch	3.68	4.64	31.81	6.02
QLoRA (NF4)	PyTorch	3.56	4.84	19.70	6.02
LLM.int8()	PyTorch	6.58	7.71	5.24	5.89
K-Quants 4bit	Llama.cpp	3.80	7.38	104.45	5.96
OmniQuant 3bit	MLC-LLM	3.20	5.10	83.4	6.65
OmniQuant 4bit	MLC-LLM	3.80	5.70	134.2	5.97

Engine Results

Method	Hardware Support	Quantization Type	WM (GB)	RM (GB)	Tokens/s	Perplexity
Llama.cpp	NVIDIA GPU	GGUF K-Quant 2bit	2.36	3.69	102.15	6.96
	AMD GPU	GGUF 4bit	3.56	4.88	128.97	5.96
	Apple Silicon	GGUF AWQ 4bit	3.56	4.88	129.25	5.91
	CPU	GGUF K-Quant 4bit	3.59	4.90	109.72	5.87
		GGUF 8bit	6.67	7.78	93.39	5.79
		GGUF FP16	12.55	13.22	66.81	5.79
ExLlama	NVIDIA GPU	GPTQ 4bit	3.63	5.35	77.10	6.08
	AMD GPU
ExLlamav2	NVIDIA GPU	EXL2 2bit	2.01	5.21	153.75	20.21
	AMD GPU	EXL2 4bit	3.36	6.61	131.68	6.12
		GPTQ 4bit	3.63	6.93	151.30	6.03
		EXL2 8bit	6.37	9.47	115.81	5.76
		FP16	12.55	15.09	67.70	5.73
vLLM	NVIDIA GPU	AWQ GEMM 4bit	3.62	34.55	114.43	6.02
	AMD GPU	GPTQ 4bit	3.63	36.51	172.88	6.08
		FP16	12.55	35.92	79.74	5.85
TensorRT-LLM	NVIDIA GPU	AWQ GEMM 4bit	3.42	5.69	194.86	6.02
		GPTQ 4bit	3.60	5.88	202.16	6.08
		INT8	6.53	8.55	143.57	5.89
		FP16	12.55	14.61	83.43	5.85
TGI	NVIDIA GPU	AWQ GEMM 4bit	3.62	36.67	106.84	6.02
	AMD GPU	GPTQ 4bit	3.69	37.85	163.22	6.08
	Intel GPU	FP4	12.55	37.21	36.91	6.15
	AWS Inferentia2	NF4	12.55	37.21	36.32	6.02
		BF16	12.55	38.03	73.59	5.89
		FP16	12.55	38.03	74.19	5.85
MLC-LLM	NVIDIA GPU	OmniQuant 3bit	3.2	5.1	83.4	6.65
	AMD GPU, CPU, WebGPU	OmniQuant 4bit	3.8	5.7	134.2	5.97
	Apple Silicon, Intel GPU, WASM, Adreno Mali	FP16	12.55	15.38	87.37	5.85

Acknowledgements

We extend our gratitude to the following repositories and sources for providing essential methods, engines, and datasets utilized in our benchmarking project:

Llama-2-7b-hf - Hugging Face model repository for Llama-2-7b.
llama.cpp - Source for llama.cpp, a key engine method used in our benchmarks.
exllama - Repository for the ExLlama engine method.
exllamav2 - Source for ExLlamaV2 engine method.
alpaca-cleaned - Alpaca dataset on Hugging Face, utilized in our benchmarks.
squeezellm - Repository for SqueezeLLM quantization method.
squeezellmgradients - Repository for SqueezeLLM-gradients.
omniquant - Source for OmniQuant quantization method.
mlcllm - Repository for the MLC-LLM engine method.
llmpruner - Source for LLM-Pruner pruning method.
tensorrtllm - Source for TensorRT-LLM engine method (branch: release/0.5.0).
autogptq - Repository for AutoGPTQ, offering quantization package based on the GPTQ algorithm.
autoawq - Repository for AutoAWQ, implementing the AWQ algorithm for 4-bit quantization.
vllm - Source for vllm package offering the inference and serving engine

These resources have been instrumental in conducting the benchmarks and evaluations. We appreciate the creators and maintainers of these repositories for their valuable contributions.