Motion-DeepLab

July 12, 2022 · View on GitHub

Motion-DeepLab is a unified model for the task of video panoptic segmentation, which requires to segment and track every pixel. It is built on top of Panoptic-DeepLab and uses an additional branch to regress each pixel to its center location in the previous frame. Instead of using a single RGB image as input, the network input contains two consecutive frames, i.e., the current and previous frame, as well as the center heatmap from the previous frame, similar to CenterTrack [1]. The output is used to assign consistent track IDs to all instances throughout a video sequence.

Prerequisite

Make sure the software is properly installed.
Make sure the target dataset is correctly prepared (e.g., KITTI-STEP).
Download the Cityscapes pretrained checkpoints listed below, and update the initial_checkpoint path in the config files.

Model Zoo

KITTI-STEP Video Panoptic Segmentation

Initial checkpoint: We provide several Cityscapes pretrained checkpoints for KITTI-STEP experiments. Please download them and update the initial_checkpoint path in the config files.

Model	Download	Note
Panoptic-DeepLab	initial_checkpoint	The initial checkpoint for single-frame baseline.
Motion-DeepLab	initial_checkpoint	The initial checkpoint for two-frame baseline.

We also provide checkpoints pretrained on KITTI-STEP below. If you would like to train those models by yourself, please find the corresponding config files under the directories configs/kitti/panoptic_deeplab (single-frame-baseline) or configs/kitti/motion_deeplab (two-frame-baseline).

Panoptic-DeepLab (single-frame-baseline):

Backbone	Output stride	Dataset split	PQ†	AP^Mask†	mIoU
ResNet-50 (config, ckpt)	32	KITTI-STEP train set	48.31	42.22	71.16
ResNet-50 (config, ckpt)	32	KITTI-STEP trainval set	-	-	-

†: See Q4 in FAQ.

This single-frame baseline could be used together with other state-of-the-art optical flow methods (e.g., RAFT [2]) for propagating mask predictions from one frame to another, as shown in our STEP paper.

Motion-DeepLab (two-frame-baseline):

Backbone	Output stride	Dataset split	PQ†	AP^Mask†	mIoU	STQ
ResNet-50 (config, ckpt)	32	KITTI-STEP train set	42.08	37.52	63.15	57.7
ResNet-50 (config, ckpt)	32	KITTI-STEP trainval set	-	-	-	-

†: See Q4 in FAQ.

MOTChallenge-STEP Video Panoptic Segmentation

Initial checkpoint: We provide several Cityscapes pretrained checkpoints for MOTChallenge-STEP experiments. Please download them and update the initial_checkpoint path in the config files.

Model	Download	Note
Panoptic-DeepLab	initial_checkpoint	The initial checkpoint for single-frame baseline.
Motion-DeepLab	initial_checkpoint	The initial checkpoint for two-frame baseline.

We also provide checkpoints pretrained on MOTChallenge-STEP below. If you would like to train those models by yourself, please find the corresponding config files under the directories for configs/motchallenge/panoptic_deeplab (single-frame-baseline) or configs/motchallenge/motion_deeplab (two-frame-baseline).

Panoptic-DeepLab (single-frame-baseline):

TODO: Add pretrained checkpoint.

Backbone	Output stride	Dataset split	PQ†	AP^Mask†	mIoU
ResNet-50 (config)	32	MOTChallenge-STEP train set	?	?	?
ResNet-50	32	MOTChallenge-STEP trainval set	-	-	-

†: See Q4 in FAQ.

Motion-DeepLab (two-frame-baseline):

TODO: Add pretrained checkpoint.

Backbone	Output stride	Dataset split	PQ†	AP^Mask†	mIoU	STQ
ResNet-50 (config)	32	MOTChallenge-STEP train set	?	?	?	?
ResNet-50	32	MOTChallenge-STEP trainval set	-	-	-	-

†: See Q4 in FAQ.

Citing Motion-DeepLab

If you find this code helpful in your research or wish to refer to the baseline results, please use the following BibTeX entry.

STEP (Motion-DeepLab):

@article{step_2021,
 author = {Weber, Mark and Xie, Jun and Collins, Maxwell and Zhu, Yukun and Voigtlaender, Paul and Adam, Hartwig and Green, Bradley and Geiger, Andreas and Leibe, Bastian and Cremers, Daniel and O\v{s}ep, Aljo\v{s}a and Leal-Taix\'{e}, Laura and Chen, Liang-Chieh},
 journal = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 title = {{STEP}: Segmenting and Tracking Every Pixel},
 year = {2021}
}

References

Xingyi Zhou, Vladlen Koltun, and Philipp Krahenbuhl. Tracking objects as points. ECCV, 2020
Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field transforms for optical flow. In ECCV, 2020