Experiments on Inferencing Speed

February 9, 2021 · View on GitHub

Experiment 1: All runs get almost exactly the same outputs

Machine 1: 2 GTX 1080 TI, i7, nvme

Machine 2: 3 GTX 1080 TI + 1 TITAN X, E5, nvme

Machine 3: 4 RTX 2080 TI , i9-9900X, SSD

Each run is conducted without other programs running except *.

Machine 1

RunType	# Image	Image Size	# GPU Used	runtime (s)	GPU Average Utilization	Per GPU FPS
1	206268	1920x1080	2	41190.9	65.3%	2.50
2	206268	1920x1080	2	37920.9	62.0%	2.72
3	206268	1920x1080	2	31494.2	54.8%	3.27
2	206268	1920x1080	1	75652.2	70.1%	2.73
3	206268	1920x1080	1	57484.9	68.8%	3.59

Machine 2

RunType	# Image	Image Size	# GPU Used	runtime (s)	GPU Average Utilization	Per GPU FPS
2	206268	1920x1080	4	29354.4	33.8%	1.76
3	206268	1920x1080	4	26454.0	23.5%	1.95
3	206268	1920x1080	2	35975.7	38.0%	2.87
2	206268	1920x1080	1	74065.0	41.4%	2.78
3	206268	1920x1080	1	57880.8	54.8%	3.56
2	206268	1920x1080	4 / 1*	17556.3	46.2%	2.94
3	206268	1920x1080	4 / 1*	14552.5	52.3%	3.54
5	206268	1920x1080	4 / 1*	17027.5	53.2%	3.03
6	206268	1920x1080	4 / 1*	13433.3	61.7%	3.84

Machine 3

RunType	# Image	Image Size	# GPU Used	runtime (s)	GPU Average Utilization	Per GPU FPS
2	206268	1920x1080	1	57834.6	61.2%	3.57
3	206268	1920x1080	1	43510.9	61.2%	4.74
2	206268	1920x1080	4 / 1*	14143.3	62.6%	3.65
3	206268	1920x1080	4 / 1*	10676.3	65.2%	4.83

TODO: Add input queue mechanism to improve GPU utilization.

RunType
1	tf 1.10 (CUDA 9, cudnn 7.1), Variable Model
2	tf 1.13 (CUDA 10.0 cudnn 7.4), Variable Model
3	tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
4	tf 1.13 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TensorRT Optimized
5	tf 1.14.0 (CUDA 10.0 cudnn 7.4), Variable Model
6	tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)

4 / 1 * means that I run 4 single-gpu jobs in parallel, just the same as you would run in the full system.

Conclusions:

TF v1.10 -> v1.13 (CUDA 9 & cuDNN v7.1 -> CUDA 10 & cuDNN v7.4) ~ +9% faster
Use frozen graph ~ +30% faster
GTX 1080 TI -> RTX 2080 TI ~ +30% faster

Note that I didn't have time to run these experiments repeatly so I expect the numbers to have large variances.

Freezing the model

To freeze the model into a .pb file:

$ python main.py nothing nothing --mode pack --pack_model_path obj_v3.pb \
--load_from obj_v3_model/ --num_class 15 \
--diva_class3 --rpn_batch_size 256 --frcnn_batch_size 512 --rpn_test_post_nms_topk 1000 --is_fpn \
--use_dilation --max_size 1920 --short_edge_size 1080

Run testing on the v1-val set:

$ python main.py nothing v1-validate_frames.lst --mode forward --outbasepath \
obj_v3_val_output --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 2 --im_batch_size 2 --load_from obj_v3.pb  --is_load_from_pb --log

Assuming v1-validate_frames.lst contains absolute path of all images. This will output one json in COCO detection format for each image in obj_v3_val_output/. The --log will run nvidia-smi every couple second in a separate thread to record the gpu utilizations.

Experiment 2: TensorRT Optimization (TF v1.14.0)

Use TensorRT to optimize frozen graph (I tried FP32, FP16):

$ python tensorrt_optimize.py obj_v5_tfv1.14.0.pb obj_v5_tfv1.14.0.tensorRT.fp32.pb --precision_mode FP32

Inferencing (run 4 program in parallel):

$ python main.py nothing v1-validate_frames.1.lst --mode forward --outbasepath \
obj_v5_val_output_TRT_FP32 --num_class 15 --diva_class3 --max_size 1920 --short_edge_size \
1080 --gpu 1 --im_batch_size 1 --gpuid_start 0 --load_from obj_v5_tfv1.14.0.tensorRT.fp32.pb  --is_load_from_pb --log

I haven't explored --maximum_cached_engines and INT8 mode yet. And ideally these experiments should be repeated a couple of times.

Experiments:

RunType	Model: obj_v5
1	tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb)
2	tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP32
3	tf 1.14.0 (CUDA 10.0 cudnn 7.4), Frozen Graph (.pb) -> TRT FP16

Machine 2

RunType	# Image	Image Size	# GPU Used	runtime (s)	GPU Average Utilization	Per GPU FPS
1	206268	1920x1080	4 / 1*	13195.3	62.97%	3.91
2	206268	1920x1080	4 / 1*	13125.5	61.02%	3.93
3	206268	1920x1080	4 / 1*	13261.9	52.63%	3.89

12/2020, multiple-image batch processing

We test the multiple-image batch processing using these commands for the FPN-ResNet50 model. The machine is with GTX 1070 TI, i7-8700K, SSD. The test video is a single 5-minute video and we test detect and track with 1280x720 resolution, frame_gap=8.

RunType	Time	GPU Median Utilization	GPU Average Utilization
b=1 var	06:21	53.00%	54.24%
b=1 frozen	05:06	34.50%	36.30%
b=1 frozen,partial	03:43	57.00%	49.55%
b=4 var	04:35	50.00%	52.48%
b=4 frozen	03:18	64.00%	63.32%
b=4 frozen,partial	03:15	42.00%	48.05%
b=8 var	04:27	00.00%	2.35%
b=8 frozen	03:12	62.00%	53.37%
b=8 frozen,partial	03:07	75.50%	62.11%

Now with multi-threading:

RunType	Time	GPU Median Utilization	GPU Average Utilization
b=4 var,m	03:41	100.00%	83.24%
b=4 frozen	02:22	70.50%	71.50%
b=4 frozen,partial,m	02:14	76.00%	75.46%
b=8 var,m	03:29	100.00%	73.45%
b=8 frozen,m	02:14	67.00%	63.69%
b=8 frozen,partial,m	02:08	99.00%	83.83%

Here is the same test on a better GPU but worse CPU machine: TITAN Xp, i5-4460, SSD. 0% median GPU means the bottleneck is in CPU processing.

RunType	Time	GPU Median Utilization	GPU Average Utilization
b=1 var	05:51	50.00%	44.15%
b=1 frozen	04:58	34.00%	29.55%
b=1 frozen,partial	03:17	42.00%	40.58%
b=4 var	04:33	26.00%	37.00%
b=4 frozen	03:17	12.00%	25.26%
b=4 frozen,partial	03:09	00.00%	17.78%
b=8 var	04:29	8.00%	40.88%
b=8 frozen	03:15	00.00%	34.32%
b=8 frozen,partial	03:07	2.50%	29.11%

Now with multi-threading:

RunType	Time	GPU Median Utilization	GPU Average Utilization
b=4 var,m	03:08	38.00%	46.83%
b=4 frozen	01:53	90.00%	76.00%
b=4 frozen,partial,m	01:40	66.00%	64.56%
b=8 var,m	03:08	31.00%	46.78%
b=8 frozen,m	01:51	46.00%	45.90%
b=8 frozen,partial,m	01:45	94.00%	81.50%

Here we compare the machine-wise utilization graph for "b=1 frozen,partial" and "multi-threading b=8 frozen,partial" on the GTX 1070 TI machine:

b=1 frozen,partial

b=8 frozen,partial,m