llm-inference-benchmark

December 16, 2023 ยท View on GitHub

LLM Inference benchmark

Inference frameworks

FrameworkProducibility****Docker ImageAPI ServerOpenAI API ServerWebUIMulti Models**Multi-nodeBackendsEmbedding Model
text-generation-webuiLowYesYesYesYesNoNoTransformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformersNo
OpenLLMHighYesYesYesNoWith BentoMLWith BentoMLTransformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRTNo
vLLM*HighYesYesYesNoNoYes(With Ray)vLLMNo
XinferenceHighYesYesYesYesYesYesTransformers/vLLM/TensorRT/GGMLYes
TGI***MediumYesYesNoNoNoNoTransformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2No
ScaleLLMMediumYesYesYesYesNoNoTransformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2No
FastChatHighYesYesYesYesYesYesTransformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2Yes
  • *vLLM/TGI can also serve as a backend.
  • **Multi Models: Capable of loading multiple models simultaneously.
  • ***TGI does not support chat mode; manual parsing of the prompt is required.

Inference backends

BackendDeviceCompatibility**PEFT Adapters*QuatisationBatchingDistributed InferenceStreaming
TransformersGPUHighYesbitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq)YesaccelerateYes
vLLMGPUHighNoawq/squeezellmYesYesYes
ExLlamaV2GPU/CPULowNoGPTQYesYesYes
TensorRTGPUMediumNosome modelsYesYesYes
CandleGPU/CPULowNoNoYesYesYes
CTranslate2GPULowNoYesYesYesYes
TGIGPUMediumYesawq/eetq/gptq/bitsandbytesYesYesYes
llama-cpp***GPU/CPUHighNoGGUF/GPTQYesNoYes
lmdeployGPUMediumNoAWQYesYesYes
Deepspeed-FastGenGPULowNoNoYesYesYes
  • *PEFT Adapters: support to load seperate PEFT adapters(mostly lora).
  • **Compatibility: High: Compatible with most models; Medium: Compatible with some models; Low: Compatible with few models.
  • ***llama.cpp's Python binding: llama-cpp-python.

Benchmark

Hardware:

  • GPU: 1x NVIDIA RTX4090 24GB
  • CPU: Intel Core i9-13900K
  • Memory: 96GB

Software:

  • VM: WSL2 on Windows 11
  • Guest OS: Ubuntu 22.04
  • NVIDIA Driver Version: 536.67
  • CUDA Version: 12.2
  • PyTorch: 2.1.1

Model:

Data:

  • Prompt Length: 512 (with some random characters to avoid cache).
  • Max Tokens: 200.

Backend Benchmark

No Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
text-generation-webui Transformer40.390.1541.470.21344.61
text-generation-webui Transformer with flash-attention-258.300.2143.520.21341.39
text-generation-webui ExllamaV269.090.2650.710.27564.80
OpenLLM PyTorch60.790.2244.730.21514.55
TGI192.580.9059.680.2882.72
vLLM222.631.0862.690.3095.43
TensorRT-----
CTranslate2*-----
lmdeploy236.031.1567.860.3376.81
  • bs: Batch Size. bs=4 indicates the batch size is 4.

  • TPS: Tokens Per Second.

  • QPS: Queries Per Second.

  • FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.

  • Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.

8Bit Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
TGI eetq 8bit293.081.4188.080.4263.69
TGI GPTQ 8bit-----
OpenLLM PyTorch AutoGPTQ 8bit49.80.1729.540.14930.16
  • bitsandbytes is very slow (int8 6.8 tokens/s), so we don't benchmark it.
  • eetq-8bit doesn't require specific model.
  • TGI GPTQ 8bit load failed: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
    • TGI GPTQ bit use exllama or triton backend.

4Bit Quantisation

BackendTPS@4QPS@4TPS@1QPS@1FTL@1
TGI AWQ 4bit336.471.61102.000.4894.84
vLLM AWQ 4bit29.030.1437.480.193711.0
text-generation-webui llama-cpp GGUF 4bit67.630.3756.650.34331.57