VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
January 18, 2025 · View on GitHub
| Ground Truth (GT) | Reconstructed |
|---|---|
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author) The Hong Kong University of Science and Technology
Project Page | Paper | High-Res Demo
A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.
✨ Features
- High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
- Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
- State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.
📰 News
- [Jan 2025] 🏋️ Released training code & better pretrained 4z-text weight
- [Dec 2024] 🚀 Released inference code and pretrained models
- [Dec 2024] 📝 Released paper on arXiv
- [Dec 2024] 💡 Project page is live at VideoVAE+
⏰ Todo
- Release Pretrained Model Weights
- Release Inference Code
- Release Training Code
🚀 Get Started
Follow these steps to set up your environment and run the code:
1. Clone the Repository
git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus
2. Set Up the Environment
Create a Conda environment and install dependencies:
conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt
📦 Pretrained Models
| Model Name | Latent Channels | Download Link |
|---|---|---|
| sota-4z | 4 | Download |
| sota-4z-text | 4 | Download |
| sota-16z | 16 | Download |
| sota-16z-text | 16 | Download |
- Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.
📁 Data Preparation
To reconstruct videos and images using our VAE model, organize your data in the following structure:
Videos
Place your videos and optional captions in the examples/videos/gt directory.
Directory Structure:
examples/videos/
├── gt/
│ ├── video1.mp4
│ ├── video1.txt # Optional caption
│ ├── video2.mp4
│ ├── video2.txt
│ └── ...
├── recon/
└── (reconstructed videos will be saved here)
- Captions: For cross-modal reconstruction, include a
.txtfile with the same name as the video containing its caption.
Images
Place your images in the examples/images/gt directory.
Directory Structure:
examples/images/
├── gt/
│ ├── image1.jpg
│ ├── image2.png
│ └── ...
├── recon/
└── (reconstructed images will be saved here)
- Note: The images dataset does not require captions.
🔧 Inference
Our video VAE supports both image and video reconstruction.
Please ensure that the ckpt_path in all your configuration files is set to the actual path of your checkpoint.
Video Reconstruction
Run video reconstruction using:
bash scripts/run_inference_video.sh
This is equivalent to:
python inference_video.py \
--data_root 'examples/videos/gt' \
--out_root 'examples/videos/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--chunk_size 8 \
--resolution 720 1280
-
If the chunk size is too large, you may encounter memory issues. In this case, reduce the
chunk_sizeparameter. Ensure thechunk_sizeis divisible by 4. -
To enable cross-modal reconstruction using captions, modify
config_pathto'configs/config_16z_cap.yaml'for the 16-channel model with caption guidance.
Image Reconstruction
Run image reconstruction using:
bash scripts/run_inference_image.sh
This is equivalent to:
python inference_image.py \
--data_root 'examples/images/gt' \
--out_root 'examples/images/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--batch_size 1
- Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.
🏋️ Training
Quick Start
To start training, use the following command:
bash scripts/run_training.sh config_16z
This default command trains the 16-channel model with video reconstruction on a single GPU.
Configuration Options
You can modify the training configuration by changing the config parameter:
config_4z: 4-channel modelconfig_4z_joint: 4-channel model trained jointly on both image and video dataconfig_4z_cap': 4-channel model with text guidanceconfig_16z: Default 16-channel modelconfig_16z_joint: 16-channel model trained jointly on both image and video dataconfig_16z_cap: 16-channel model with text guidance
Note: Do not include the .yaml extension when specifying the config.
Data Preparation
Dataset Structure
The training data should be organized in a CSV file with the following format:
path,text
/absolute/path/to/video1.mp4,A person walking on the beach
/absolute/path/to/video2.mp4,A car driving down the road
Requirements:
- Use absolute paths for video files
- Include two columns: path and text
- For training without text guidance, leave the caption column empty but maintain the CSV structure
Example CSV:
# With captions
/data/videos/clip1.mp4,A dog playing in the park
/data/videos/clip2.mp4,Sunset over the ocean
# Without captions
/data/videos/clip1.mp4,
/data/videos/clip2.mp4,
📊 Evaluation
Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.
Evaluate Image Reconstruction
bash scripts/evaluation_image.sh
Evaluate Video Reconstruction
bash scripts/evaluation_video.sh
📝 License
Please follow CC-BY-NC-ND.









