tkDNN

March 30, 2022 · View on GitHub

tkDNN is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. It has been tested on TK1(branch cudnn2), TX1, TX2, AGX Xavier, Nano and several discrete GPUs. The main goal of this project is to exploit NVIDIA boards as much as possible to obtain the best inference performance. It does not allow training.

If you use tkDNN in your research, please cite the following paper. For use in commercial solutions, write at gattifrancesco@hotmail.it and micaela.verucchi@unimore.it or refer to https://hipert.unimore.it/ .

@inproceedings{verucchi2020systematic,
  title={A Systematic Assessment of Embedded Neural Networks for Object Detection},
  author={Verucchi, Micaela and Brilli, Gianluca and Sapienza, Davide and Verasani, Mattia and Arena, Marco and Gatti, Francesco and Capotondi, Alessandro and Cavicchioli, Roberto and Bertogna, Marko and Solieri, Marco},
  booktitle={2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)},
  volume={1},
  pages={937--944},
  year={2020},
  organization={IEEE}
}

What's new

20 July 2021

  • Support to sematic segmentation README
  • Support 2D/3D Object Detection and Tracking README

24 November 2021

30 March 2022

FPS Results

Inference FPS of yolov4 with tkDNN, average of 1200 images with the same dimension as the input size, on

  • RTX 2080Ti (CUDA 10.2, TensorRT 7.0.0, Cudnn 7.6.5);
  • Xavier AGX, Jetpack 4.3 (CUDA 10.0, CUDNN 7.6.3, tensorrt 6.0.1 );
  • Xavier NX, Jetpack 4.4 (CUDA 10.2, CUDNN 8.0.0, tensorrt 7.1.0 ).
  • Tx2, Jetpack 4.2 (CUDA 10.0, CUDNN 7.3.1, tensorrt 5.0.6 );
  • Jetson Nano, Jetpack 4.4 (CUDA 10.2, CUDNN 8.0.0, tensorrt 7.1.0 ).
PlatformNetworkFP32, B=1FP32, B=4FP16, B=1FP16, B=4INT8, B=1INT8, B=4
RTX 2080Tiyolo4 320118.59237.31207.81443.32262.37530.93
RTX 2080Tiyolo4 416104.81162.86169.06293.78206.93353.26
RTX 2080Tiyolo4 51292.98132.43140.36215.17165.35254.96
RTX 2080Tiyolo4 60863.7781.53111.39152.89127.79184.72
AGX Xavieryolo4 32026.7832.0557.1479.0573.1597.56
AGX Xavieryolo4 41619.9621.5241.0149.0050.8160.61
AGX Xavieryolo4 51216.5816.9831.1233.8437.8241.28
AGX Xavieryolo4 6089.4510.1321.9223.3627.0528.93
Xavier NXyolo4 32014.5616.2530.1441.1542.1353.42
Xavier NXyolo4 41610.0210.6022.4325.5929.0832.94
Xavier NXyolo4 5128.108.3215.7817.1320.5122.46
Xavier NXyolo4 6085.265.1811.5412.0615.0915.82
Tx2yolo4 32011.1812.0715.3216.31--
Tx2yolo4 4167.307.589.459.90--
Tx2yolo4 5125.965.957.227.23--
Tx2yolo4 6083.633.654.674.70--
Nanoyolo4 3204.234.556.146.53--
Nanoyolo4 4162.883.003.904.04--
Nanoyolo4 5122.322.343.023.04--
Nanoyolo4 6081.401.411.921.93--

MAP Results

Results for COCO val 2017 (5k images), on RTX 2080Ti, with conf threshold=0.001

CodaLabCodaLabCodaLabCodaLabtkDNN maptkDNN map
tkDNNtkDNNdarknetdarknettkDNNtkDNN
MAP(0.5:0.95)AP50MAP(0.5:0.95)AP50MAP(0.5:0.95)AP50
Yolov3 (416x416)0.3810.6750.3800.6750.3720.663
yolov4 (416x416)0.4680.7050.4710.7100.4590.695
yolov3tiny (416x416)0.0960.2020.0960.2010.0930.198
yolov4tiny (416x416)0.2020.4000.2010.4000.1970.395
Cnet-dla34 (512x512)0.3660.543--0.3610.535
mv2SSD (512x512)0.2260.381--0.2230.378

Index

Dependencies

This branch works on every NVIDIA GPU that supports the following (latest tested) dependencies:

  • CUDA 11.3 (or >= 10.2)
  • cuDNN 8.2.1 (or >= 8.0.4)
  • TensorRT 8.0.3 (or >=7.2)
  • OpenCV 4.5.4 (or >=4)
  • cmake 3.21 (or >= 3.15)
  • yaml-cpp 0.5.2
  • eigen3 3.3.4
  • curl 7.58
sudo apt install libyaml-cpp-dev curl libeigen3-dev

About OpenCV

To compile and install OpenCV4 with contrib us the script install_OpenCV4.sh. It will download and compile OpenCV in Download folder.

bash scripts/install_OpenCV4.sh

If you have OpenCV compiled with cuda and contrib and want to use it with tkDNN pass ENABLE_OPENCV_CUDA_CONTRIB=ON flag when compiling tkDBB . If the flag is not passed,the preprocessing of the networks is computed on the CPU, otherwise on the GPU. In the latter case some milliseconds are saved in the end-to-end latency.

How to compile this repo

Build with cmake. If using Ubuntu 18.04 a new version of cmake is needed (3.15 or above). On both linux and windows ,the CMAKE_BUILD_TYPE variable needs to be defined as either Release or Debug.

git clone https://github.com/ceccocats/tkDNN
cd tkDNN
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release .. 
make

Workflow

Steps needed to do inference on tkDNN with a custom neural network.

  • Build and train a NN model with your favorite framework.
  • Export weights and bias for each layer and save them in a binary file (one for layer).
  • Export outputs for each layer and save them in a binary file (one for layer).
  • Create a new test and define the network, layer by layer using the weights extracted and the output to check the results.
  • Do inference.

Exporting weights

For specific details on how to export weights see HERE.

Run the demos

For specific details on how to run:

  • 2D object detection demos, details on FP16, INT8 and batching see HERE.
  • segmentation demos see HERE.
  • monocular depth estimation see HERE.
  • 2D/3D object detection and tracking demos see HERE.
  • mAP demo to evaluate 2D object detectors see HERE.

demo

tkDNN on Windows 10 or Windows 11

For specific details on how to run tkDNN on Windows 10/11 see HERE.

Existing tests and supported networks

Test NameNetworkDatasetN ClassesInput sizeWeights
yoloYOLO v21COCO 201480608x608weights
yolo_224YOLO v21COCO 201480224x224weights
yolo_berkeleyYOLO v21BDD100K 10416x736weights
yolo_reluYOLO v2 (with ReLU, not Leaky)1COCO 201480416x416weights
yolo_tinyYOLO v2 tiny1COCO 201480416x416weights
yolo_vocYOLO v21VOC 21416x416weights
yolo3YOLO v32COCO 201480416x416weights
yolo3_512YOLO v32COCO 201780512x512weights
yolo3_berkeleyYOLO v32BDD100K 10320x544weights
yolo3_coco4YOLO v32COCO 20144416x416weights
yolo3_flirYOLO v32FREE FLIR3320x544weights
yolo3_tinyYOLO v3 tiny2COCO 201480416x416weights
yolo3_tiny512YOLO v3 tiny2COCO 201780512x512weights
dla34Deep Leayer Aggreagtion (DLA) 343COCO 201480224x224weights
dla34_cnetCenternet (DLA34 backend)4COCO 201780512x512weights
mobilenetv2ssdMobilnet v2 SSD Lite5VOC 21300x300weights
mobilenetv2ssd512Mobilnet v2 SSD Lite5COCO 201781512x512weights
resnet101Resnet 1016COCO 201480224x224weights
resnet101_cnetCenternet (Resnet101 backend)4COCO 201780512x512weights
csresnext50-panet-sppCross Stage Partial Network 7COCO 201480416x416weights
yolo4Yolov4 8COCO 201780416x416weights
yolo4_320Yolov4 8COCO 201780320x320weights
yolo4_512Yolov4 8COCO 201780512x512weights
yolo4_608Yolov4 8COCO 201780608x608weights
yolo4_berkeleyYolov4 8BDD100K 10544x320weights
yolo4tinyYolov4 tiny 9COCO 201780416x416weights
yolo4xYolov4x-mish 9COCO 201780640x640weights
yolo4tiny_512Yolov4 tiny 9COCO 201780512x512weights
yolo4x-cpsScaled Yolov4 10COCO 201780512x512weights
shelfnetShelfNet18_realtime11Cityscapes191024x1024weights
shelfnet_berkeleyShelfNet18_realtime11DeepDrive201024x1024weights
dla34_cnet3dCenternet3D (DLA34 backend)4KITTI 20171512x512weights
dla34_ctrackCenterTrack (DLA34 backend)12NuScenes 3D7512x512weights
monodepth2Monodepth2 13KITTI DEPTH-640x192weights-mono
monodepth2Monodepth2 13KITTI DEPTH-640x192weights-stereo

References

  1. Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  2. Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
  3. Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  4. Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850 (2019).
  5. Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  6. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  7. Wang, Chien-Yao, et al. "CSPNet: A New Backbone that can Enhance Learning Capability of CNN." arXiv preprint arXiv:1911.11929 (2019).
  8. Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).
  9. Bochkovskiy, Alexey, "Yolo v4, v3 and v2 for Windows and Linux" (https://github.com/AlexeyAB/darknet)
  10. Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "Scaled-YOLOv4: Scaling Cross Stage Partial Network." arXiv preprint arXiv:2011.08036 (2020).
  11. Zhuang, Juntang, et al. "ShelfNet for fast semantic segmentation." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.
  12. Zhou, Xingyi, Vladlen Koltun, and Philipp Krähenbühl. "Tracking objects as points." European Conference on Computer Vision. Springer, Cham, 2020.
  13. Godard, Clément, et al. "Digging into self-supervised monocular depth estimation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

Contributors

The main contibutors, in chronological order, are: