README.md

March 5, 2026 · View on GitHub

GenDeF: Learning Generative Deformation Field for Video Generation

Wen Wang1,2*   Kecheng Zheng2   Qiuyu Wang2   Hao Chen1†   Zifan Shi3,2*   Ceyuan Yang4
Yujun Shen2†   Chunhua Shen1

*Intern at Ant Group   Corresponding Author

1Zhejiang University   2Ant Group   3HKUST   4Shanghai Artificial Intellgence Laboratory

Paper PDF Project Page

We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.

Getting Started

Prerequisites

  • Python 3.8+
  • CUDA 11.3+
  • PyTorch 1.11+ and torchvision 0.12+

Installation

# Clone the repository
git clone https://github.com/aim-uofa/GenDeF.git
cd GenDeF

# Install PyTorch (adjust for your CUDA version, see https://pytorch.org/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

# Install Python dependencies
pip install -r requirements.txt

# Install the project in editable mode
pip install -e .
``$


## \text{Dataset} \text{Preparation}

\text{We} \text{support} \text{training} \text{on} \text{the} \text{following} \text{video} \text{datasets}:
- **\text{YouTube} \text{Driving} (\text{YTB})**: \text{YouTube} \text{driving} \text{videos} \text{at} 256 \times 256 \text{resolution}
- **\text{SkyTimelapse}**: \text{Sky} \text{timelapse} \text{videos} \text{at} 256 \times 256 \text{resolution}
- **\text{TaiChi}-\text{HD}**: \text{Tai} \text{Chi} \text{videos} \text{at} 256 \times 256 \text{resolution}

\text{Organize} \text{the} \text{dataset} \text{as} \text{a} \text{zip} \text{archive} \text{and} \text{place} \text{it} \text{in} \text{the} $data/` directory:

data/ ytb_256.zip # YouTube Driving dataset sky_256.zip # SkyTimelapse dataset (optional) taichi_256.zip # TaiChi-HD dataset (optional)


Each zip file should contain video frames organized as described in the [StyleGAN-V](https://github.com/universome/stylegan-v) dataset format.


## Training

Training follows a **two-stage** pipeline. Below we use the **TaiChi-HD** dataset as an example.

### Stage 1: Pretrain (Image Generation)

In this stage, we train a 2D image generator backbone with deformable convolutions. The model learns to generate single frames (i.e., `num_frames_per_video=1`), building a strong image generation foundation.

```bash
bash scripts/train_taichi_stage1_pretrain.sh
Key hyperparameters for Stage 1
ParameterValueDescription
sampling.num_frames_per_video1Single-frame training
model.generator.fmaps0.5Generator feature map multiplier
model.discriminator.fmaps0.5Discriminator feature map multiplier
model.generator.dcntrueEnable deformable convolution in generator
model.discriminator.tsmfalseDisable temporal shift module in discriminator
model.loss_kwargs.r1_gamma0.5R1 regularization weight
model.generator.learnable_motion_maskfalseDisable learnable motion mask
model.generator.time_enc.min_period_len16Minimum period length for time encoding
training.augadaAdaptive augmentation
training.batch_size64Total batch size
num_gpus8Number of GPUs

Stage 2: Finetune (Video Generation with Deformation Field)

In this stage, we introduce the canonical image generation and deformation field prediction modules. The model learns to generate videos by warping a canonical image with a predicted deformation field. The pretrained checkpoint from Stage 1 is used as initialization.

bash scripts/train_taichi_stage2_finetune.sh
Key hyperparameters for Stage 2
ParameterValueDescription
sampling.num_frames_per_video3Multi-frame training
model.generator.fmaps0.5Generator feature map multiplier
model.discriminator.fmaps0.5Discriminator feature map multiplier
model.discriminator.tsmtrueEnable temporal shift module
model.loss_kwargs.r1_gamma8R1 regularization weight (increased)
model.generator.with_canonicaltrueEnable canonical image generation
model.generator.canonical_condconcatCanonical conditioning method
model.generator.canonical_cond_dim64Canonical conditioning dimension
model.generator.canonical_featL13_256_64Feature level for canonical image
model.generator.deform_dcntrueEnable DCN for deformation prediction
model.generator.deform_dcn_min_res4Min resolution for deform DCN
model.generator.deform_dcn_max_res64Max resolution for deform DCN
model.generator.deform_dcn_torgbtrueEnable DCN for toRGB layers
training.resumeStage 1 ckptResume from Stage 1 pretrained model

Key Differences between Stage 1 and Stage 2

AspectStage 1 (Pretrain)Stage 2 (Finetune)
Frames per video1 (image-only)3 (video)
Temporal modelingDisabled (tsm=false)Enabled (tsm=true)
Canonical imageNot usedEnabled
Deformation fieldNot usedEnabled with DCN
R1 gamma0.58.0
Learnable motion maskfalsetrue

Generation (Sampling)

After training, generate videos using:

bash scripts/generate_videos.sh

You can customize the generation by editing the script or passing arguments directly:

python src/scripts/generate_ours.py \
    --network_pkl output/taichi_finetune/output/best.pkl \
    --num_videos 100 \
    --save_as_mp4 true \
    --fps 25 \
    --video_len 128 \
    --batch_size 25 \
    --outdir sample/taichi \
    --truncation_psi 0.9 \
    --seed 42
ArgumentDescription
--network_pklPath to the trained model checkpoint (.pkl)
--num_videosNumber of videos to generate
--video_lenNumber of frames per video
--fpsFrames per second for saved mp4
--truncation_psiTruncation (lower = higher quality, less diversity)
--save_as_mp4Save as mp4 video files
--seedRandom seed for reproducibility

Main Results

Applications

Video Editing

Point Tracking

Video Segmentation

Diverse Motion Generation

Acknowledgements

This codebase is built on top of StyleGAN-V. We thank the authors for their excellent work.

Citing

If you find our work useful, please consider citing:

@misc{wang2023gendef,
    title={GenDeF: Learning Generative Deformation Field for Video Generation},
    author={Wen Wang and Kecheng Zheng and Qiuyu Wang and Hao Chen and Zifan Shi and Ceyuan Yang and Yujun Shen and Chunhua Shen},
    year={2023},
    eprint={2312.04561},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}