FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis (SVD-based FloVD)

May 2, 2025 · View on GitHub


Teaser image 1

[Project Page] [arXiv]

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
POSTECH, Microsoft Research Asia

News

  • Our paper has been accepted to CVPR 2025!
  • We release CogVideoX-based FloVD. Check this out! FloVD-CogVideoX

FloVD-CogVideoX-5B

Abstract

We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.

TODO

  • Release SVD-based FloVD codes
  • Release evaluation benchmark dataset for object motion synthesis quality (SVD backbone)
  • Release CogVideoX-based FloVD codes
  • Release evaluation benchmark dataset for object motion synthesis quality (CogVideoX backbone)

Preparation

  • Environment (Python==3.10; CUDA==12.1; torch==2.4.1)
conda create -n flovd python=3.10.6 -y
source activate flovd
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
  • Build Grounded_SAM2 (Segmentation model)
bash build_grounded_sam2.sh
  • Checkpoints
    Download the FloVD checkpoints below
    [FVSM_EDM] [FVSM_Quadratic] [OMSM]
    In addition, we used the pre-trained video diffusion model (SVD), the off-the-shelf depth estimation model (Depth Anything V2, metric depth) and the segmentation model (Grounded SAM 2, open-vocabulary segmentation method). For these models, please refer links below.
    [SVD] [Depth_anything_v2_metric] [Grounded_SAM2]

Inference

  • Preparation
    Before sampling, set path (configuration, checkpoint) in the bash script Before sampling, set video data for inference. You need only one frame per scene for the input image.
# File tree
./[data_root]/
├── frames
   ├── [scene_name]
   ├── 00.png
   ├── ... (not_necessary)
  • Sample video frames
    FloVD synthesizes 14-frame videos.
bash scripts/inference_FloVD.sh

Tips

  • Provided inference code will save depth-warped images using the input camera parameters. You can forecast the camera control results with the warped images. If the translation vector in the camera parameter is too large, you can adjust the 'speed' term in the inference code.
  • For better camera controllability, you might use the FVSM-Quadratic model. For better video synthesis quality, we recommend you to use the FVSM-EDM model.

Training FloVD

  • Training Dataset
# Prepare your own dataset
# File tree
# metadata.json includes path list to each video data
./[data_root]/
├── metadata
 metadata.json
├── video
   ├── xxxxx.mp4
   ├── ...
  • Preparation
    Before training, set path (SVD backbone, Dataset, Depth_anything_v2, Grounded_SAM2) in the configuration files.

  • FVSM

bash scripts/train_FVSM.sh
  • OMSM
bash scripts/train_OMSM.sh
bash scripts/train_OMSM_Curated.sh

Evaluation

  • For the evaluation of the object motion synthesis quality, use the benchmark datasets below.
  • We provide two benchmark datasets, one for SVD and another for CogVideoX.
    Motion_eval_benchmark_SVD and Motion_eval_benchmark_CogVideox include video clips with 14 frames and 49 frames, respectively. For Motion_eval_benchmark_CogVideox, we use video clips with 16 fps.
    [Motion_eval_benchmark_SVD] [Motion_eval_benchmark_CogVideoX]
  • For detailed description about the evaluation protocol, please refer to the Sec. 5.2 of the main paper.
  • If you use the benchmark datasets of the object motion synthesis quality, please cite our paper.

Others

  • We heavily borrow codes from [CameraCtrl]. Thanks for their contributions.
@article{jin2025flovd,
         title={FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis},
         author={Jin, Wonjoon and Dai, Qi and Luo, Chong and Baek, Seung-Hwan and Cho, Sunghyun},
         journal={arXiv preprint arXiv:2502.08244},
         year={2025}
}