Robust Re-Identification by Multiple Views Knowledge Distillation

July 3, 2023 ยท View on GitHub

This repository contains Pytorch code for the ECCV20 paper "Robust Re-Identification by Multiple Views Knowledge Distillation" [arXiv]

VKD - Overview

@inproceedings{porrello2020robust,    
    title={Robust Re-Identification by Multiple Views Knowledge Distillation},
    author={Porrello, Angelo and Bergamini, Luca and Calderara, Simone},
    booktitle={European Conference on Computer Vision},
    pages={93--110},
    year={2020},
    organization={Springer}
}

Installation Note

Tested with Python3.6.8 on Ubuntu (17.04, 18.04).

  • Setup an empty pip environment
  • Install packages using pip install -r requirements.txt
  • Install torch1.3.1 using pip install torch==1.3.1+cu92 torchvision==0.4.2+cu92 -f https://download.pytorch.org/whl/torch_stable.html
  • Place datasets in .datasets/ (Please note you may need do request some of them to their respective authors)
  • Run scripts from commands.txt

Please note that if you're running the code from Pycharm (or another IDE) you may need to manually set the working path to PROJECT_PATH

VKD Training (MARS [1])

Data preparation

  • Create the folder ./datasets/mars
  • Download the dataset from here
  • Unzip data and place the two folders inside the MARS [1] folder
  • Download metadata from here
  • Place them in a folder named info under the same path
  • You should end up with the following structure:
PROJECT_PATH/datasets/mars/
|-- bbox_train/
|-- bbox_test/
|-- info/

Teacher-Student Training

First step: the backbone network is trained for the standard Video-To-Video setting. In this stage, each training example comprises of N images drawn from the same tracklet (N=8 by default; you can change it through the argument --num_train_images.

# To train ResNet-50 on MARS (teacher, first step) run:
python ./tools/train_v2v.py mars --backbone resnet50 --num_train_images 8 --p 8 --k 4 --exp_name base_mars_resnet50 --first_milestone 100 --step_milestone 100

Second step: we appoint it as the teacher and freeze its parameters. Then, a new network with the role of the student is instantiated. In doing so, we feed N views (i.e. images captured from multiple cameras) as input to the teacher and ask the student to mimic the same outputs from fewer (M=2 by default,--num_student_images) frames.

# To train a ResVKD-50 (student) run:
python ./tools/train_distill.py mars ./logs/base_mars_resnet50 --exp_name distill_mars_resnet50 --p 12 --k 4 --step_milestone 150 --num_epochs 500

Model Zoo

We provide a bunch of pre-trained checkpoints through two zip files (baseline.zip containing the weights of the teacher networks, distilled.zip the student ones). Therefore, to evaluate ResNet-50 and ResVKD-50 on MARS, proceed as follows:

  • Download baseline.zip from here and distilled.zip from here (~4.8 GB)
  • Unzip the two folders inside the PROJECT_PATH/logs folder
  • Then, you can evaluate both networks using the eval.py script:
python ./tools/eval.py mars ./logs/baseline_public/mars/base_mars_resnet50 --trinet_chk_name chk_end
python ./tools/eval.py mars ./logs/distilled_public/mars/selfdistill/distill_mars_resnet50 --trinet_chk_name chk_di_1

You should end up with the following results on MARS (see Tab.1 of the paper for VeRi-776 and Duke-Video-ReID):

Backbonetop1 I2VmAP I2Vtop1 V2VmAP V2V
ResNet-3480.8170.7486.6778.03
ResVKD-3482.1773.6887.8379.50
ResNet-5082.2273.3887.8881.13
ResVKD-5083.8977.2788.7482.22
ResNet-10182.7874.9488.5981.66
ResVKD-10185.9177.6489.6082.65
Backbonetop1 I2VmAP I2Vtop1 V2VmAP V2V
ResNet-50bam82.5874.1188.5481.19
ResVKD-50bam84.3478.1389.3983.07
Backbonetop1 I2VmAP I2Vtop1 V2VmAP V2V
DenseNet-12182.6874.3489.7581.93
DenseVKD-12184.0477.0989.8082.84
Backbonetop1 I2VmAP I2Vtop1 V2VmAP V2V
MobileNet-V278.6467.9485.9677.10
MobileVKD-V283.3373.9588.1379.62

Teacher-Student Explanations

As discussed in the main paper, we have leveraged GradCam [2] to highlight the input regions that have been considered paramount for predicting the identity. We have performed the same analysis for the teacher network as well as for the student one: as can be seen, the latter pays more attention to the subject of interest compared to its teacher.

Model Explanation

You can draw the heatmaps with the following command:

python -u ./tools/save_heatmaps.py mars <path-to-teacher-net> --chk_net1 <teacher-checkpoint-name> <path-to-student-net> --chk_net2 <student-checkpoint-name> --dest_path <output-dir>

References

  1. Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: A video benchmark for large-scale person re-identification. In: European Conference on Computer Vision (2016)
  2. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).