vid2vid: Video-to-Video Synthesis

October 13, 2021 · View on GitHub

Pytorch implementation for high-resolution (e.g., 2048x1024) photorealistic video-to-video translation. It can be used for turning semantic label maps into photo-realistic videos, synthesizing people talking from edge maps, or generating human motions from poses.

Project | YouTube(short) | YouTube(full) | arXiv | Paper(full) | Previous Implementation | Two Minute Papers Video

teaser

License

Imaginaire is released under NVIDIA Software license. For commercial use, please consult researchinquiries@nvidia.com

Software Installation

For installation, please checkout INSTALL.md.

Hardware Requirement

We trained our models using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try to use fewer GPUs or reduce the batch size if it does not fit in your GPU memory, but training stability and image quality cannot be guaranteed.

Datasets

Cityscapes

We use the Cityscapes dataset as an example. To train a model on the full dataset, please download it from the official website (registration required). We apply a pre-trained segmentation algorithm to get the corresponding segmentation maps.

Dancing

We use random dancing videos found on YouTube. You can also obtain a dancing dataset by simply recording a video of someone doing different motions for a few minutes. After that, please apply OpenPose on the frames to get the pose information.

Training

The following shows the example commands to train vid2vid on the Cityscapes dataset. To train it on other datasets, replace all cityscapes with dancing.

  • Download the dataset and put it in the format as following. For Cityscapes:
cityscapes
└───images
    └───seq0001
        └───000001.png
        └───000002.png
        ...
    └───seq0002
        └───000001.png
        └───000002.png
        ...
    ...
└───seg_maps
    └───seq0001
        └───000001.png
        └───000002.png
        ...
    └───seq0002
        └───000001.png
        └───000002.png
        ...
    ...

For the Dancing dataset:

dancing
└───images
    └───seq0001
        └───000001.jpg
        └───000002.jpg
        ...
└───poses-openpose
    └───seq0001
        └───000001.json
        └───000002.json
        ...
  • Preprocess the data into LMDB format
python scripts/build_lmdb.py --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired
  • Train on 8 GPUs with AMPO1
python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml

Inference

  • Download some test data by running
python ./scripts/download_test_data.py --model_name vid2vid
  • Or arrange your own data into the same format as the training data described above.

  • Translate segmentation masks to images

    • Inference command
      python inference.py --single_gpu \
      --config configs/projects/vid2vid/cityscapes/ampO1.yaml \
      --output_dir projects/vid2vid/output/cityscapes
      

Below we show an example output video:

output

Citation

If you use this code for your research, please cite our papers.

@inproceedings{wang2018vid2vid,
   title     = {Video-to-Video Synthesis},
   author    = {Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Guilin Liu
                and Andrew Tao and Jan Kautz and Bryan Catanzaro},   
   booktitle = {Conference on Neural Information Processing Systems (NeurIPS)}},
   year      = {2018}
}