README.md

March 5, 2026 · View on GitHub

GenDeF: Learning Generative Deformation Field for Video Generation

Wen Wang^1,2* Kecheng Zheng² Qiuyu Wang² Hao Chen^1† Zifan Shi^3,2* Ceyuan Yang⁴
Yujun Shen^2† Chunhua Shen¹

^*Intern at Ant Group ^†Corresponding Author

¹Zhejiang University ²Ant Group ³HKUST ⁴Shanghai Artificial Intellgence Laboratory

We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.

Getting Started

Prerequisites

Python 3.8+
CUDA 11.3+
PyTorch 1.11+ and torchvision 0.12+

Installation

# Clone the repository
git clone https://github.com/aim-uofa/GenDeF.git
cd GenDeF

# Install PyTorch (adjust for your CUDA version, see https://pytorch.org/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

# Install Python dependencies
pip install -r requirements.txt

# Install the project in editable mode
pip install -e .
``$


## \text{Dataset} \text{Preparation}

\text{We} \text{support} \text{training} \text{on} \text{the} \text{following} \text{video} \text{datasets}:
- **\text{YouTube} \text{Driving} (\text{YTB})**: \text{YouTube} \text{driving} \text{videos} \text{at} 256 \times 256 \text{resolution}
- **\text{SkyTimelapse}**: \text{Sky} \text{timelapse} \text{videos} \text{at} 256 \times 256 \text{resolution}
- **\text{TaiChi}-\text{HD}**: \text{Tai} \text{Chi} \text{videos} \text{at} 256 \times 256 \text{resolution}

\text{Organize} \text{the} \text{dataset} \text{as} \text{a} \text{zip} \text{archive} \text{and} \text{place} \text{it} \text{in} \text{the} $data/` directory:

data/ ytb_256.zip # YouTube Driving dataset sky_256.zip # SkyTimelapse dataset (optional) taichi_256.zip # TaiChi-HD dataset (optional)


Each zip file should contain video frames organized as described in the [StyleGAN-V](https://github.com/universome/stylegan-v) dataset format.


## Training

Training follows a **two-stage** pipeline. Below we use the **TaiChi-HD** dataset as an example.

### Stage 1: Pretrain (Image Generation)

In this stage, we train a 2D image generator backbone with deformable convolutions. The model learns to generate single frames (i.e., `num_frames_per_video=1`), building a strong image generation foundation.

```bash
bash scripts/train_taichi_stage1_pretrain.sh

Key hyperparameters for Stage 1

Parameter	Value	Description
`sampling.num_frames_per_video`	1	Single-frame training
`model.generator.fmaps`	0.5	Generator feature map multiplier
`model.discriminator.fmaps`	0.5	Discriminator feature map multiplier
`model.generator.dcn`	true	Enable deformable convolution in generator
`model.discriminator.tsm`	false	Disable temporal shift module in discriminator
`model.loss_kwargs.r1_gamma`	0.5	R1 regularization weight
`model.generator.learnable_motion_mask`	false	Disable learnable motion mask
`model.generator.time_enc.min_period_len`	16	Minimum period length for time encoding
`training.aug`	ada	Adaptive augmentation
`training.batch_size`	64	Total batch size
`num_gpus`	8	Number of GPUs

Stage 2: Finetune (Video Generation with Deformation Field)

In this stage, we introduce the canonical image generation and deformation field prediction modules. The model learns to generate videos by warping a canonical image with a predicted deformation field. The pretrained checkpoint from Stage 1 is used as initialization.

bash scripts/train_taichi_stage2_finetune.sh

Key hyperparameters for Stage 2

Parameter	Value	Description
`sampling.num_frames_per_video`	3	Multi-frame training
`model.generator.fmaps`	0.5	Generator feature map multiplier
`model.discriminator.fmaps`	0.5	Discriminator feature map multiplier
`model.discriminator.tsm`	true	Enable temporal shift module
`model.loss_kwargs.r1_gamma`	8	R1 regularization weight (increased)
`model.generator.with_canonical`	true	Enable canonical image generation
`model.generator.canonical_cond`	concat	Canonical conditioning method
`model.generator.canonical_cond_dim`	64	Canonical conditioning dimension
`model.generator.canonical_feat`	L13_256_64	Feature level for canonical image
`model.generator.deform_dcn`	true	Enable DCN for deformation prediction
`model.generator.deform_dcn_min_res`	4	Min resolution for deform DCN
`model.generator.deform_dcn_max_res`	64	Max resolution for deform DCN
`model.generator.deform_dcn_torgb`	true	Enable DCN for toRGB layers
`training.resume`	Stage 1 ckpt	Resume from Stage 1 pretrained model

Key Differences between Stage 1 and Stage 2

Aspect	Stage 1 (Pretrain)	Stage 2 (Finetune)
Frames per video	1 (image-only)	3 (video)
Temporal modeling	Disabled (`tsm=false`)	Enabled (`tsm=true`)
Canonical image	Not used	Enabled
Deformation field	Not used	Enabled with DCN
R1 gamma	0.5	8.0
Learnable motion mask	false	true

Generation (Sampling)

After training, generate videos using:

bash scripts/generate_videos.sh

You can customize the generation by editing the script or passing arguments directly:

python src/scripts/generate_ours.py \
    --network_pkl output/taichi_finetune/output/best.pkl \
    --num_videos 100 \
    --save_as_mp4 true \
    --fps 25 \
    --video_len 128 \
    --batch_size 25 \
    --outdir sample/taichi \
    --truncation_psi 0.9 \
    --seed 42

Argument	Description
`--network_pkl`	Path to the trained model checkpoint (`.pkl`)
`--num_videos`	Number of videos to generate
`--video_len`	Number of frames per video
`--fps`	Frames per second for saved mp4
`--truncation_psi`	Truncation (lower = higher quality, less diversity)
`--save_as_mp4`	Save as mp4 video files
`--seed`	Random seed for reproducibility

@misc{wang2023gendef,
    title={GenDeF: Learning Generative Deformation Field for Video Generation},
    author={Wen Wang and Kecheng Zheng and Qiuyu Wang and Hao Chen and Zifan Shi and Ceyuan Yang and Yujun Shen and Chunhua Shen},
    year={2023},
    eprint={2312.04561},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

README.md

GenDeF: Learning Generative Deformation Field for Video Generation

Getting Started

Prerequisites

Installation

Stage 2: Finetune (Video Generation with Deformation Field)

Key Differences between Stage 1 and Stage 2

Generation (Sampling)

Main Results

Applications

Video Editing

Point Tracking

Video Segmentation

Diverse Motion Generation

Acknowledgements

Citing