Motion-DeepLab

July 12, 2022 · View on GitHub

Motion-DeepLab is a unified model for the task of video panoptic segmentation, which requires to segment and track every pixel. It is built on top of Panoptic-DeepLab and uses an additional branch to regress each pixel to its center location in the previous frame. Instead of using a single RGB image as input, the network input contains two consecutive frames, i.e., the current and previous frame, as well as the center heatmap from the previous frame, similar to CenterTrack [1]. The output is used to assign consistent track IDs to all instances throughout a video sequence.

Prerequisite

  1. Make sure the software is properly installed.

  2. Make sure the target dataset is correctly prepared (e.g., KITTI-STEP).

  3. Download the Cityscapes pretrained checkpoints listed below, and update the initial_checkpoint path in the config files.

Model Zoo

KITTI-STEP Video Panoptic Segmentation

Initial checkpoint: We provide several Cityscapes pretrained checkpoints for KITTI-STEP experiments. Please download them and update the initial_checkpoint path in the config files.

ModelDownloadNote
Panoptic-DeepLabinitial_checkpointThe initial checkpoint for single-frame baseline.
Motion-DeepLabinitial_checkpointThe initial checkpoint for two-frame baseline.

We also provide checkpoints pretrained on KITTI-STEP below. If you would like to train those models by yourself, please find the corresponding config files under the directories configs/kitti/panoptic_deeplab (single-frame-baseline) or configs/kitti/motion_deeplab (two-frame-baseline).

Panoptic-DeepLab (single-frame-baseline):

BackboneOutput strideDataset splitPQ†APMaskmIoU
ResNet-50 (config, ckpt)32KITTI-STEP train set48.3142.2271.16
ResNet-50 (config, ckpt)32KITTI-STEP trainval set---

†: See Q4 in FAQ.

This single-frame baseline could be used together with other state-of-the-art optical flow methods (e.g., RAFT [2]) for propagating mask predictions from one frame to another, as shown in our STEP paper.

Motion-DeepLab (two-frame-baseline):

BackboneOutput strideDataset splitPQ†APMaskmIoUSTQ
ResNet-50 (config, ckpt)32KITTI-STEP train set42.0837.5263.1557.7
ResNet-50 (config, ckpt)32KITTI-STEP trainval set----

†: See Q4 in FAQ.

MOTChallenge-STEP Video Panoptic Segmentation

Initial checkpoint: We provide several Cityscapes pretrained checkpoints for MOTChallenge-STEP experiments. Please download them and update the initial_checkpoint path in the config files.

ModelDownloadNote
Panoptic-DeepLabinitial_checkpointThe initial checkpoint for single-frame baseline.
Motion-DeepLabinitial_checkpointThe initial checkpoint for two-frame baseline.

We also provide checkpoints pretrained on MOTChallenge-STEP below. If you would like to train those models by yourself, please find the corresponding config files under the directories for configs/motchallenge/panoptic_deeplab (single-frame-baseline) or configs/motchallenge/motion_deeplab (two-frame-baseline).

Panoptic-DeepLab (single-frame-baseline):

TODO: Add pretrained checkpoint.

BackboneOutput strideDataset splitPQ†APMaskmIoU
ResNet-50 (config)32MOTChallenge-STEP train set???
ResNet-5032MOTChallenge-STEP trainval set---

†: See Q4 in FAQ.

This single-frame baseline could be used together with other state-of-the-art optical flow methods (e.g., RAFT [2]) for propagating mask predictions from one frame to another, as shown in our STEP paper.

Motion-DeepLab (two-frame-baseline):

TODO: Add pretrained checkpoint.

BackboneOutput strideDataset splitPQ†APMaskmIoUSTQ
ResNet-50 (config)32MOTChallenge-STEP train set????
ResNet-5032MOTChallenge-STEP trainval set----

†: See Q4 in FAQ.

Citing Motion-DeepLab

If you find this code helpful in your research or wish to refer to the baseline results, please use the following BibTeX entry.

  • STEP (Motion-DeepLab):
@article{step_2021,
 author = {Weber, Mark and Xie, Jun and Collins, Maxwell and Zhu, Yukun and Voigtlaender, Paul and Adam, Hartwig and Green, Bradley and Geiger, Andreas and Leibe, Bastian and Cremers, Daniel and O\v{s}ep, Aljo\v{s}a and Leal-Taix\'{e}, Laura and Chen, Liang-Chieh},
 journal = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 title = {{STEP}: Segmenting and Tracking Every Pixel},
 year = {2021}
}

References

  1. Xingyi Zhou, Vladlen Koltun, and Philipp Krahenbuhl. Tracking objects as points. ECCV, 2020

  2. Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field transforms for optical flow. In ECCV, 2020