DashInfer-VLM Benchmark

December 16, 2024 · View on GitHub

This document presents performance data and reproduction steps for dashinfer-vlm, along with methods for comparing it to other open-source multimodal frameworks.

Evaluation Method

Download dataset

You need prepare conversations and images to run the model benchmarks. Here are public datasets from OpenGVLab/InternVL-Chat-V1-2-SFT-Data. The api benchmark will combine prompts and images into requests, then send them to the API endpoint using the OpenAI client.

Launch Server

vLLM:

vllm serve qwen/Qwen2-VL-2B-Instruct/ --allowed-local-media-path <image_folder> --limit-mm-per-prompt image=10

Note: to make sure the same output token length, we set ignore_eos=True in vllm/sampling_params.py

dashinfer-vlm (set VLM_BENCHMARK=1 to specify generated token lengths, i.e. ignore_eos=True):

VLM_BENCHMARK=1 dashinfer_vlm_serve --model qwen/Qwen2-VL-2B-Instruct --host 127.0.0.1

You could enable prefix and fp8 to accelerate inference. See full options using dashinfer_vlm_serve -h.

Run Benchmark

Using OpenAI client:

python benchmark_openai_api.py --prompt-file <json_file> \
--image-folder <image_folder> \
--req-nums 100 \
--batch-size 32 \
--image-nums-mean 3 \
--image-nums-range 1 \
--response-mean 120 \
--response-len-range 64

The above command generates 100 requests from prompt_file and images_folder. Each request contains 3 ±\pm 1 images and outputs 120 ±\pm 64 tokens. The requests are sent to the server in parallel using 8 threads.

To benchmark multi-turn conversations, add --multi-turn <num> in above command.

Model Performance

Qwen2-VL

The steps to reproduce the performance of Qwen2-VL 2B/7B:

  1. Launch api serve as the Evaluation Method section.
  2. Benchmark command:
python benchmark_openai_api.py --prompt-file docvqa_train_10k.jsonl \
--image-folder `pwd`/share_textvqa \
--req-nums 100 \
--image-nums-mean 3 \
--image-nums-range 1 \
--response-len-range 64 \
--response-mean 120 \
--batch-size <1 | 32> \
--multi-turn <0 | 2>

Qwen/Qwen2-VL-2B-Instruct on A100

Optimizations#DeviceDtypePrompt Tokens (avg)Output Tokens (avg)# Previous Conversations#ConcurrentvLLM (0.6.4) QPSDashInfer-VLM QPSSpeedup (dashinfer-vlm/vllm)
-1FP162807120010.800.780.98
-1FP1628071200322.293.851.68
-1FP1629531202322.216.052.73
enable_prefix_cache1FP162953120232x7.98x

Qwen/Qwen2-VL-7B-Instruct on A100

Optimizations#DeviceDtypePrompt Tokens (avg)Output Tokens (avg)# Previous Conversations#ConcurrentvLLM (0.6.4) QPSDashInfer-VLM QPSSpeedup (dashinfer-vlm/vllm)
-1FP162807120010.470.440.93
-1FP1628071200321.632.201.35
-1FP1629531202321.502.841.89
enable_prefix_cache1FP162953120232x4.68x

Qwen/Qwen2-VL-72B-Instruct on 4xA100

Optimizations#DeviceDtypePrompt Tokens (avg)Output Tokens (avg)# Previous Conversations#ConcurrentvLLM (0.6.4) QPSDashInfer-VLM QPSSpeedup (dashinfer-vlm/vllm)
-1FP162807120010.20.150.75
-1FP1628071200320.790.991.26
-1FP1629531202320.821.121.36
enable_prefix_cache1FP162953120232x1.99x