README.md

April 3, 2026 · View on GitHub

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction (ICCV25)

Yuhui Wu^1,2, Liyi Chen¹, Ruibin Li^1,2, Shihao Wang¹, Chenxi Xie^1,2, Lei Zhang^1,2*

(*Corresponding Author)

¹The Hong Kong Polytechnic University, ²OPPO Research Institute

▶️ Watch our demo video on Youtube. We provide a smoother video and add more editing results. It is suggested to alter the resolution for better visual quality.

https://www.youtube.com/watch?v=z4t3RkqZ4no

https://github.com/user-attachments/assets/846f1fc3-3200-4e26-b4a5-2124cedee571

Abstract

Click for the full abstract

Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works.

Updates

[3/26/2025] Paper is available on ArXiv.

TODO

Release the pretrained model.
Update the code for inference.
Release the InsViE-1M dataset.
Update the code for training.

Usage

Installation

Clone the repo and install dependent packages

https://github.com/langmanbusi/InsViE.git
cd InsViE

# follow the instruction of original CogVideoX repo
cd CogVideo
pip install -r requirements.txt
cd sat
pip install -r requirements.txt
# use the given environment.yml
conda env create -f environment.yml

Inference

First download the weights of T5 and VAE models follow the instruction of CogVideoX.

Then download the weight of our InsViE. The floder structure is the same with original CogVideo:

.
├── train_edit
    ├── 1000 (or 1)
    │   └── mp_rank_00_model_states.pt
    └── latest

You should also modify the configs in InsViE/CogVideo/sat/config, such as the path to the pretrained models, refer to link.

args:
  image2video: False # True for image2video, False for text2video
  latent_channels: 16
  mode: inference
  load: "/xxx/ckpts_2b_lora/train_edit" # This is for Full model without lora adapter
  batch_size: 1
  input_type: txt # You can choose txt for pure text input, or change to cli for command line input 
  input_file: /xxx/mytest.csv # prepare a test csv, which stores the video file names and instructions in each row
  test_folder: mytest # the folder contains the videos corresponding to the input_file (mytest.csv)
  sampling_image_size: [480, 720] # [480, 720]
  sampling_num_frames: 13  # Must be 13, 11 or 9
  sampling_fps: 7
  fp16: True # For CogVideoX-2B
  # bf16: True # For CogVideoX-5B and CoGVideoX-5B-I2V
  output_dir: /xxx/ # set the folder of the outputs
  force_inference: True

Then run the script.

cd InsViE/CogVideo/sat
bash inference.sh

Training

This project provides scripts and configs for edit fine-tuning (train_video_edit.py + configs/cogvideox_2b_lora_edit.yaml + configs/sft_edit_resume_nv.yaml).

Prepare pretrained weights:
- CogVideoX-2b-sat/transformer (or LoRA checkpoint directory)
- CogVideoX-2b-sat/vae/3d-vae.pt
- t5-v1_1-xxl (T5 encoder model)
Update configuration paths:
- In configs/cogvideox_2b_lora_edit.yaml:
  - set conditioner_config.params.emb_models[0].params.model_dir to your T5 path
  - set first_stage_config.params.ckpt_path to your 3D VAE weight path
- In configs/sft_edit_resume_nv.yaml:
  - set args.load to transformer weights
  - set args.save to output checkpoint directory
  - set data.params.train_video_root / data.params.test_video_root and data.path to dataset paths
Single-GPU training example:

cd InsViE/CogVideo/sat
bash finetune_single_gpu.sh

Multi-GPU training example:

cd InsViE/CogVideo/sat
bash finetune_multi_gpus.sh
# or NV environment
bash finetune_multi_gpus_nv.sh

Alternatively run directly:

cd InsViE/CogVideo/sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=8 train_video_edit.py --base configs/cogvideox_2b_lora_edit.yaml configs/sft_edit_resume_nv.yaml --seed $RANDOM

Verify training results:
- args.save directory contains train_edit_* checkpoint folders
- training_config.yaml is saved during first iteration for reproducibility

InsViE-1M Dataset

Citation

If you find this work helpful, please consider citing:

@article{wu2025insvie,
  title={InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction},
  author={Wu, Yuhui and Chen, Liyi and Li, Ruibin and Wang, Shihao and Xie, Chenxi and Zhang, Lei},
  journal={arXiv preprint arXiv:2503.20287},
  year={2025}
}