Video-Infinity

June 25, 2024 ยท View on GitHub

Video-Infinity


arXiv arXiv

Video-Infinity: Distributed Long Video Generation
Zhenxiong Tan, Xingyi Yang, Songhua Liu, and Xinchao Wang
Learning and Vision Lab, National University of Singapore

TL;DR (Too Long; Didn't Read)

Video-Infinity generates long videos quickly using multiple GPUs without extra training. Feel free to visit our project page for more information and generated videos.

Features

  • Distributed ๐ŸŒ: Utilizes multiple GPUs to generate long-form videos.
  • High-Speed ๐Ÿš€: Produces 2,300 frames in just 5 minutes.
  • Training-Free ๐ŸŽ“: Generates long videos without requiring additional training for existing models.

Setup

Installation Environment

conda create -n video_infinity_vc2 python=3.10
conda activate video_infinity_vc2
pip install -r requirements.txt

Usage

Quick Start

  • Basic Usage
python inference.py --config examples/config.json
  • Multi-Prompts
python inference.py --config examples/multi_prompts.json
  • Single GPU
python inference.py --config examples/single_gpu.json

Config

Basic Config

ParameterDescription
devicesThe list of GPU devices to use.
base_pathThe path to save the generated videos.

Pipeline Config

ParameterDescription
promptsThe list of text prompts. Note: The number of prompts should be greater than the number of GPUs.
file_nameThe name of the generated video.
num_framesThe number of frames to generate on each GPU.

Video-Infinity Config

ParameterDescription
*.paddingThe number of local context frames.
attn.topkThe number of global context frames for Attention model.
attn.local_phaseWhen the denoise timestep is less than t, it bias the attention. This adds a local_bias to the local context frames and a global_bias to the global context frames.
attn.global_phaseIt is similar to local_phase. But it bias the attention when the denoise timestep is greater than t.
attn.token_num_scaleIf the value is True, the scale factor will be rescaled by the number of tokens. Default is False. More details can be referred to this paper.

How to Set Config

  • To avoid the loss of high-frequency information, we recommend setting the sum of padding and attn.topk to be less than 24 (which is similar to the number of the default frames in the VideoCrafter2 model).
    • If you wish to have a larger padding or attn.topk, you should set the attn.token_num_scale to True.
  • A higher local_phase.t and global_phase.t will result in more stable videos but may reduce the diversity of the videos.
  • More padding will provide more local context.
  • A higher attn.topk will bring about overall stability in the videos.

Citation

@article{
  tan2024videoinf,
  title={Video-Infinity: Distributed Long Video Generation},
  author={Zhenxiong Tan, Xingyi Yang, Songhua Liu, and Xinchao Wang},
  journal={arXiv preprint arXiv:2406.16260},
  year={2024}
}

Acknowledgements

Our project is based on the VideoCrafter2 model. We would like to thank the authors for their excellent work! โค๏ธ