Experiments on Inferencing Speed

February 9, 2021 ยท View on GitHub

Experiment 1: All runs get almost exactly the same outputs

Machine 1: 2 GTX 1080 TI, i7, nvme

Machine 2: 3 GTX 1080 TI + 1 TITAN X, E5, nvme

Machine 3: 4 RTX 2080 TI , i9-9900X, SSD

Each run is conducted without other programs running except *.

Machine 1

RunType# ImageImage Size# GPU Usedruntime (s)GPU Average UtilizationPer GPU FPS
12062681920x1080241190.965.3%2.50
22062681920x1080237920.962.0%2.72
32062681920x1080231494.254.8%3.27
22062681920x1080175652.270.1%2.73
32062681920x1080157484.968.8%3.59

Machine 2

RunType# ImageImage Size# GPU Usedruntime (s)GPU Average UtilizationPer GPU FPS
22062681920x1080429354.433.8%1.76
32062681920x1080426454.023.5%1.95
32062681920x1080235975.738.0%2.87
22062681920x1080174065.041.4%2.78
32062681920x1080157880.854.8%3.56
22062681920x10804 / 1*17556.346.2%2.94
32062681920x10804 / 1*14552.552.3%3.54
52062681920x10804 / 1*17027.553.2%3.03
62062681920x10804 / 1*13433.361.7%3.84

Machine 3

RunType# ImageImage Size# GPU Usedruntime (s)GPU Average UtilizationPer GPU FPS
22062681920x1080157834.661.2%3.57
32062681920x1080143510.961.2%4.74
22062681920x10804 / 1*14143.362.6%3.65
32062681920x10804 / 1*10676.365.2%4.83

TODO: Add input queue mechanism to improve GPU utilization.

RunType
1tf 1.10 (CUDA 9, cudnn 7.1), Variable Model
2tf 1.13 (CUDA 10.0 cudnn 7.4), Variable Model
3tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
4tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TensorRT Optimized
5tf 1.14.0 (CUDA 10.0 cudnn 7.4), Variable Model
6tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)

4 / 1 * means that I run 4 single-gpu jobs in parallel, just the same as you would run in the full system.

Conclusions:

  • TF v1.10 -> v1.13 (CUDA 9 & cuDNN v7.1 -> CUDA 10 & cuDNN v7.4) ~ +9% faster
  • Use frozen graph ~ +30% faster
  • GTX 1080 TI -> RTX 2080 TI ~ +30% faster

Note that I didn't have time to run these experiments repeatly so I expect the numbers to have large variances.

Freezing the model

To freeze the model into a .pb file:

$ python main.py nothing nothing --mode pack --pack_model_path obj_v3.pb \
--load_from obj_v3_model/ --num_class 15 \
--diva_class3 --rpn_batch_size 256 --frcnn_batch_size 512 --rpn_test_post_nms_topk 1000 --is_fpn \
--use_dilation --max_size 1920 --short_edge_size 1080

Run testing on the v1-val set:

$ python main.py nothing v1-validate_frames.lst --mode forward --outbasepath \
obj_v3_val_output --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 2 --im_batch_size 2 --load_from obj_v3.pb  --is_load_from_pb --log

Assuming v1-validate_frames.lst contains absolute path of all images. This will output one json in COCO detection format for each image in obj_v3_val_output/. The --log will run nvidia-smi every couple second in a separate thread to record the gpu utilizations.

Experiment 2: TensorRT Optimization (TF v1.14.0)

Use TensorRT to optimize frozen graph (I tried FP32, FP16):

$ python tensorrt_optimize.py obj_v5_tfv1.14.0.pb obj_v5_tfv1.14.0.tensorRT.fp32.pb --precision_mode FP32

Inferencing (run 4 program in parallel):

$ python main.py nothing v1-validate_frames.1.lst --mode forward --outbasepath \
obj_v5_val_output_TRT_FP32 --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 1 --im_batch_size 1 --gpuid_start 0 --load_from obj_v5_tfv1.14.0.tensorRT.fp32.pb  --is_load_from_pb --log

I haven't explored --maximum_cached_engines and INT8 mode yet. And ideally these experiments should be repeated a couple of times.

Experiments:

RunTypeModel: obj_v5
1tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
2tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP32
3tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP16

Machine 2

RunType# ImageImage Size# GPU Usedruntime (s)GPU Average UtilizationPer GPU FPS
12062681920x10804 / 1*13195.362.97%3.91
22062681920x10804 / 1*13125.561.02%3.93
32062681920x10804 / 1*13261.952.63%3.89

12/2020, multiple-image batch processing

We test the multiple-image batch processing using these commands for the FPN-ResNet50 model. The machine is with GTX 1070 TI, i7-8700K, SSD. The test video is a single 5-minute video and we test detect and track with 1280x720 resolution, frame_gap=8.

RunTypeTimeGPU Median UtilizationGPU Average Utilization
b=1 var06:2153.00%54.24%
b=1 frozen05:0634.50%36.30%
b=1 frozen,partial03:4357.00%49.55%
b=4 var04:3550.00%52.48%
b=4 frozen03:1864.00%63.32%
b=4 frozen,partial03:1542.00%48.05%
b=8 var04:2700.00%2.35%
b=8 frozen03:1262.00%53.37%
b=8 frozen,partial03:0775.50%62.11%

Now with multi-threading:

RunTypeTimeGPU Median UtilizationGPU Average Utilization
b=4 var,m03:41100.00%83.24%
b=4 frozen02:2270.50%71.50%
b=4 frozen,partial,m02:1476.00%75.46%
b=8 var,m03:29100.00%73.45%
b=8 frozen,m02:1467.00%63.69%
b=8 frozen,partial,m02:0899.00%83.83%

Here is the same test on a better GPU but worse CPU machine: TITAN Xp, i5-4460, SSD. 0% median GPU means the bottleneck is in CPU processing.

RunTypeTimeGPU Median UtilizationGPU Average Utilization
b=1 var05:5150.00%44.15%
b=1 frozen04:5834.00%29.55%
b=1 frozen,partial03:1742.00%40.58%
b=4 var04:3326.00%37.00%
b=4 frozen03:1712.00%25.26%
b=4 frozen,partial03:0900.00%17.78%
b=8 var04:298.00%40.88%
b=8 frozen03:1500.00%34.32%
b=8 frozen,partial03:072.50%29.11%

Now with multi-threading:

RunTypeTimeGPU Median UtilizationGPU Average Utilization
b=4 var,m03:0838.00%46.83%
b=4 frozen01:5390.00%76.00%
b=4 frozen,partial,m01:4066.00%64.56%
b=8 var,m03:0831.00%46.78%
b=8 frozen,m01:5146.00%45.90%
b=8 frozen,partial,m01:4594.00%81.50%

Here we compare the machine-wise utilization graph for "b=1 frozen,partial" and "multi-threading b=8 frozen,partial" on the GTX 1070 TI machine:

b=1 frozen,partial

b=8 frozen,partial,m