vid2vid: Video-to-Video Synthesis
October 13, 2021 · View on GitHub
Pytorch implementation for high-resolution (e.g., 2048x1024) photorealistic video-to-video translation. It can be used for turning semantic label maps into photo-realistic videos, synthesizing people talking from edge maps, or generating human motions from poses.
Project | YouTube(short) | YouTube(full) | arXiv | Paper(full) | Previous Implementation | Two Minute Papers Video
License
Imaginaire is released under NVIDIA Software license. For commercial use, please consult researchinquiries@nvidia.com
Software Installation
For installation, please checkout INSTALL.md.
Hardware Requirement
We trained our models using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try to use fewer GPUs or reduce the batch size if it does not fit in your GPU memory, but training stability and image quality cannot be guaranteed.
Datasets
Cityscapes
We use the Cityscapes dataset as an example. To train a model on the full dataset, please download it from the official website (registration required). We apply a pre-trained segmentation algorithm to get the corresponding segmentation maps.
Dancing
We use random dancing videos found on YouTube. You can also obtain a dancing dataset by simply recording a video of someone doing different motions for a few minutes. After that, please apply OpenPose on the frames to get the pose information.
Training
The following shows the example commands to train vid2vid on the Cityscapes dataset. To train it on other datasets, replace all cityscapes with dancing.
- Download the dataset and put it in the format as following. For Cityscapes:
cityscapes
└───images
└───seq0001
└───000001.png
└───000002.png
...
└───seq0002
└───000001.png
└───000002.png
...
...
└───seg_maps
└───seq0001
└───000001.png
└───000002.png
...
└───seq0002
└───000001.png
└───000002.png
...
...
For the Dancing dataset:
dancing
└───images
└───seq0001
└───000001.jpg
└───000002.jpg
...
└───poses-openpose
└───seq0001
└───000001.json
└───000002.json
...
- Preprocess the data into LMDB format
python scripts/build_lmdb.py --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired
- Train on 8 GPUs with AMPO1
python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml
Inference
- Download some test data by running
python ./scripts/download_test_data.py --model_name vid2vid
-
Or arrange your own data into the same format as the training data described above.
-
Translate segmentation masks to images
- Inference command
python inference.py --single_gpu \ --config configs/projects/vid2vid/cityscapes/ampO1.yaml \ --output_dir projects/vid2vid/output/cityscapes
- Inference command
Below we show an example output video:
Citation
If you use this code for your research, please cite our papers.
@inproceedings{wang2018vid2vid,
title = {Video-to-Video Synthesis},
author = {Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Guilin Liu
and Andrew Tao and Jan Kautz and Bryan Catanzaro},
booktitle = {Conference on Neural Information Processing Systems (NeurIPS)}},
year = {2018}
}