VideoDirector: Precise Video Editing via Text-to-Video Models (CVPR2025)

November 25, 2025 · View on GitHub

Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, Yulan Guo

[Project Page]

Edited results

Input Video Edited Results

Abstract

VideoDirector harness the powerful temporal generation capability of the text-to-video (T2V) model for precise video editing. VideoDirector produces results with high quality in terms of accuracy, fidelity, motion smoothness, and realism. For more see the project webpage.

🔧 Installations (python==3.11.3 recommended)

Setup repository and conda environment

git clone https://github.com/Yukun66/Video_Director.git 
cd Video_Director

conda env create -f environment.yaml
conda activate videodirector

💡 Pretrained Model Preparations

Download Stable Diffusion V1.5

Download Stable Diffusion, weights path is:

models/StableDiffusion/stable-diffusion-v1-5

Prepare Community Models

Manually download the community .safetensors models from RealisticVision. Community checkpoints path is:

models/DreamBooth_LoRA/realisticVisionV60B1_v51VAE.safetensors

Prepare AnimateDiff Motion Modules

Manually download the AnimateDiff modules from AnimateDiff. Save the modules to:

models/Motion_Module

📌 Preprocess

Mask prediction

We utilize the SAM2 model (https://github.com/facebookresearch/sam2) to generate masks for our method.

Run the bash file: SAM2_model/checkpoints/download_ckpts.sh to download the SAM2 weights:

cd SAM2_model/checkpoints
bash download_ckpts.sh
cd ../..

The SAM2 model is located in the SAM2_model directory and requires installation before use:

cd SAM2_model
pip install -e ".[demo]"
cd ..

We provide a using example to get mask of resources/bear.mp4 in: SAM2_model/notebooks/video_predictor_example.ipynb.

🚗 Editing video

Run our method:

bash run_editing.sh

Config details

Our editing config file is in editing_config_yaml/bear_editing_config.yaml. The config parameters are detailed below.

Prompts

inversion_prompt: original video description prompt. Example:

 "A brown bear, walking on rocky terrain, next to a stone wall."

new_prompt: target video description prompt. Example:

"A tiger, walking on rocky terrain, next to a stone wall."

p2p_eq_params_words: the new inserted words in new prompt. Example:

- tiger

STDG_guide

Coefficient of STDG guidance. Example:

-STDG_guide:
 0.5
 0.5
 0.0
 0.5

p2p_self_replace_steps

$\tau_s$ in paper Sec 3.3. Example:

p2p_self_replace_steps: 0.4

p2p_cross_replace_steps

$\tau_c$ in paper Sec 3.3. Example:

p2p_cross_replace_steps: 0.8

📝 Citation

If you find this work useful, please consider citing:

@inproceedings{wang2025videodirector,
  title={Videodirector: Precise video editing via text-to-video models},
  author={Wang, Yukun and Wang, Longguang and Ma, Zhiyuan and Hu, Qibin and Xu, Kai and Guo, Yulan},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  year={2025}
}