README.md

April 3, 2026 · View on GitHub

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction (ICCV25)

arXiv Pytorch
Dataset Model

Yuhui Wu1,2, Liyi Chen1, Ruibin Li1,2, Shihao Wang1, Chenxi Xie1,2, Lei Zhang1,2*

(*Corresponding Author)

1The Hong Kong Polytechnic University, 2OPPO Research Institute

▶️ Watch our demo video on Youtube. We provide a smoother video and add more editing results. It is suggested to alter the resolution for better visual quality.

https://www.youtube.com/watch?v=z4t3RkqZ4no

https://github.com/user-attachments/assets/846f1fc3-3200-4e26-b4a5-2124cedee571

Abstract

Click for the full abstract Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works.

Updates

  • [3/26/2025] Paper is available on ArXiv.

TODO

  • Release the pretrained model.
  • Update the code for inference.
  • Release the InsViE-1M dataset.
  • Update the code for training.

Usage

Installation

Clone the repo and install dependent packages

https://github.com/langmanbusi/InsViE.git
cd InsViE

# follow the instruction of original CogVideoX repo
cd CogVideo
pip install -r requirements.txt
cd sat
pip install -r requirements.txt
# use the given environment.yml
conda env create -f environment.yml

Inference

First download the weights of T5 and VAE models follow the instruction of CogVideoX.

Then download the weight of our InsViE. The floder structure is the same with original CogVideo:

.
├── train_edit
    ├── 1000 (or 1)
    │   └── mp_rank_00_model_states.pt
    └── latest 

You should also modify the configs in InsViE/CogVideo/sat/config, such as the path to the pretrained models, refer to link.

args:
  image2video: False # True for image2video, False for text2video
  latent_channels: 16
  mode: inference
  load: "/xxx/ckpts_2b_lora/train_edit" # This is for Full model without lora adapter
  batch_size: 1
  input_type: txt # You can choose txt for pure text input, or change to cli for command line input 
  input_file: /xxx/mytest.csv # prepare a test csv, which stores the video file names and instructions in each row
  test_folder: mytest # the folder contains the videos corresponding to the input_file (mytest.csv)
  sampling_image_size: [480, 720] # [480, 720]
  sampling_num_frames: 13  # Must be 13, 11 or 9
  sampling_fps: 7
  fp16: True # For CogVideoX-2B
  # bf16: True # For CogVideoX-5B and CoGVideoX-5B-I2V
  output_dir: /xxx/ # set the folder of the outputs
  force_inference: True

Then run the script.

cd InsViE/CogVideo/sat
bash inference.sh

Training

This project provides scripts and configs for edit fine-tuning (train_video_edit.py + configs/cogvideox_2b_lora_edit.yaml + configs/sft_edit_resume_nv.yaml).

  1. Prepare pretrained weights:

    • CogVideoX-2b-sat/transformer (or LoRA checkpoint directory)
    • CogVideoX-2b-sat/vae/3d-vae.pt
    • t5-v1_1-xxl (T5 encoder model)
  2. Update configuration paths:

    • In configs/cogvideox_2b_lora_edit.yaml:
      • set conditioner_config.params.emb_models[0].params.model_dir to your T5 path
      • set first_stage_config.params.ckpt_path to your 3D VAE weight path
    • In configs/sft_edit_resume_nv.yaml:
      • set args.load to transformer weights
      • set args.save to output checkpoint directory
      • set data.params.train_video_root / data.params.test_video_root and data.path to dataset paths
  3. Single-GPU training example:

cd InsViE/CogVideo/sat
bash finetune_single_gpu.sh
  1. Multi-GPU training example:
cd InsViE/CogVideo/sat
bash finetune_multi_gpus.sh
# or NV environment
bash finetune_multi_gpus_nv.sh
  1. Alternatively run directly:
cd InsViE/CogVideo/sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=8 train_video_edit.py --base configs/cogvideox_2b_lora_edit.yaml configs/sft_edit_resume_nv.yaml --seed $RANDOM
  1. Verify training results:
    • args.save directory contains train_edit_* checkpoint folders
    • training_config.yaml is saved during first iteration for reproducibility

InsViE-1M Dataset

Dataset

Citation

If you find this work helpful, please consider citing:

@article{wu2025insvie,
  title={InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction},
  author={Wu, Yuhui and Chen, Liyi and Li, Ruibin and Wang, Shihao and Xie, Chenxi and Zhang, Lei},
  journal={arXiv preprint arXiv:2503.20287},
  year={2025}
}