ICVE: In-Context Learning with Unpaired Clips for Instruction-based Video Editing

June 2, 2026 ยท View on GitHub

arXiv HuggingFace

Xinyao Liao1,2, Xianfang Zeng2, Ziye Song1, Zhoujie Fu1,2, Gang Yu2*, Guosheng Lin1*

1 Nanyang Technological University NTU Logo โ€ƒโ€ƒ 2 StepFun StepFun Logo
Demo GIF

Star us if you find this project useful! โญ

The new model will be coming soon. Wan2.1-T2V-14B foundation model.

๐ŸŽ‰ Updates

  • [12/2025] ๐Ÿ”ฅ Multi-GPU sequence-parallel inference.
  • [12/2025] โš ๏ธ Fix the GPU OOM bug on 480P inference.
  • [10/2025] ๐Ÿ”ฅ Model checkpoints is released!
  • [10/2025] ๐Ÿ”ฅ Codebase is released!

๐Ÿงฉ Overview

ICVE proposes a low-cost pretraining strategy for instruction-based video editing via in-context learning from unpaired clips. Built upon HunyuanVideoT2V, it first learns editing concepts from about 1M unpaired videos, then fine-tunes on <150K paired editing data for improved instruction alignment and visual quality โ€” enabling general editing operations guided by natural language.

๐ŸŽฅ Video Demo

ICVE Demo Video
Click the image above to watch the full video on YouTube ๐ŸŽฌ

๐Ÿ› ๏ธ Dependencies and Installation

Begin by cloning the repository:

git clone https://github.com/leoisufa/ICVE.git
cd ICVE

We recommend CUDA versions 12.4 or 11.8 for the manual installation.

# 1. Create conda environment
conda create -n icve python==3.10.9

# 2. Activate the environment
conda activate icve

# 3. Install PyTorch and other dependencies using conda
# For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r requirements.txt

# 5. (Optional) Install flash-attention. With it installed, `--attn-mode auto`
#    uses flash; otherwise it falls back to the PyTorch SDPA backend.
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

๐Ÿงฑ Download Models

  1. HunyuanVideo Pretrained Weights
    Follow the official HunyuanVideo instructions here:
    ๐Ÿ‘‰ Download Pretrained Models
    and place the downloaded weights into the ckpts/ directory as shown above.
  2. ICVE Checkpoint
    Download our model weights from
    ๐Ÿ‘‰ Hugging Face
    and place them in the checkpoint/ directory.

The folder structure of this project should look like this after setup:

ICVE/
โ”œโ”€โ”€ assets/
โ”œโ”€โ”€ checkpoint/ # Our model checkpoint
โ”‚   โ”œโ”€โ”€ config.json
โ”‚   โ””โ”€โ”€ diffusion_pytorch_model.safetensors
โ”œโ”€โ”€ ckpts/  # Pretrained weights from HunyuanVideo
โ”‚   โ”œโ”€โ”€ hunyuan-video-t2v-720p
โ”‚   โ”œโ”€โ”€ text_encoder
โ”‚   โ””โ”€โ”€ text_encoder_2
โ”œโ”€โ”€ hyvideo/
โ”œโ”€โ”€ scripts/  # glasses.sh / kid.sh (single-GPU), glasses_multigpu.sh (multi-GPU SP)
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ sample_video.py
โ””โ”€โ”€ README.md

๐Ÿš€ Running the Demos

OOM Bug Fix: We fixed an OOM issue caused by the VAE not enabling enable_tiling() during original-video encoding.

ICVE supports two inference modes that share the same entrypoint (sample_video.py):

  • Single-GPU โ€” fits on an 80GB GPU without offload; add --use-cpu-offload for smaller cards (slower).
  • Multi-GPU sequence parallel (SP) โ€” shards the latent across GPUs for a near-linear speedup, launched with torchrun.

By default the attention backend is auto (--attn-mode auto): it uses flash-attn if installed, otherwise falls back to the dependency-free PyTorch SDPA backend. Force a specific backend with --attn-mode torch or --attn-mode flash.

Ready-to-run demos live under scripts/:

# Single GPU
bash scripts/glasses.sh
bash scripts/kid.sh

# Multi-GPU sequence parallel (e.g. 8 GPUs)
bash scripts/glasses_multigpu.sh 8

Single-GPU command

python sample_video.py \
    --dit-weight checkpoint/diffusion_pytorch_model.safetensors \
    --video-size 768 480 \
    --video-length 61 \
    --infer-steps 50 \
    --prompt "Add black glasses to the person's face." \
    --video "assets/glasses.mp4" \
    --seed 42 \
    --embedded-cfg-scale 1.0 \
    --cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --attn-mode auto \
    --save-path ./results

Multi-GPU sequence-parallel command

torchrun --nproc_per_node=8 --master_port=29501 sample_video.py \
    --dit-weight checkpoint/diffusion_pytorch_model.safetensors \
    --video-size 768 480 \
    --video-length 61 \
    --infer-steps 50 \
    --prompt "Add black glasses to the person's face." \
    --video "assets/glasses.mp4" \
    --seed 42 \
    --embedded-cfg-scale 1.0 \
    --cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --attn-mode auto \
    --save-path ./results_multigpu

SP constraint: the latent temporal length must be divisible by the number of GPUs, where latent_T = (video_length - 1) / 4 + 1. For example --video-length 61 gives latent_T = 16, divisible by 2/4/8.

๐Ÿ™ Acknowledgements

We thank the following prior art for their excellent open source work:

๐Ÿ”— BibTeX

If you find ICVE useful for your research and applications, please cite using this BibTeX:

@article{liao2025context,
  title={In-Context Learning with Unpaired Clips for Instruction-based Video Editing},
  author={Liao, Xinyao and Zeng, Xianfang and Song, Ziye and Fu, Zhoujie and Yu, Gang and Lin, Guosheng},
  journal={arXiv preprint arXiv:2510.14648},
  year={2025}
}