README.md

June 11, 2026 · View on GitHub

CVPR 2026 🏆 Highlight
VideoCoF: Unified Video Editing with Temporal Reasoner

👁️ See → 🧠 Reason → ✏️ Edit

🚀 A Chain of Frames video editing method enbale temporal reasoning and 16x video length extrapolation with just 50k training pairs!

Xiangpeng Yang¹, Ji Xie², Yiyuan Yang¹, Yan Huang¹, Min Xu¹, Qiang Wu¹
¹University of Technology Sydney, ²Zhejiang University

💿 Introduction

https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6

🎬 Revisiting Video Editing Length Extrapolation

Although VideoCoF is trained on only 33 frames, it generalizes remarkably to much longer videos — achieving 16× single-shot editing (512 frames) and 14× multi-shot editing (415 frames). Check out the videos below and try VideoCoF on your own long videos!

Multi-shot editing (14×, 415 frames): (source video credit by StoryMem)

https://github.com/user-attachments/assets/22101cf3-d267-4bc4-834d-a4fb194efe92

Single-shot editing (16×, 512 frames):

https://github.com/user-attachments/assets/21fc04a5-c6e3-4864-abb5-070c5f44ee88

🔥 News

2026.05.23: 🎉 We released the VideoCoF training code.
2026.04.09: 🏆 VideoCoF was honored as a CVPR 2026 Highlight.
2026.02.22: 🎉 VideoCoF was accepted to CVPR 2026!
2026.01.02: 📚 We released the full VideoCoF-50k training dataset! Check it out at Hugging Face Datasets.
2025.12.13: 🚀 We released a 4-step fast inference script (10s per video on H100) and launched the Hugging Face demo! Please try it at Hugging Face Spaces.
2025.12.09: Paper available on arXiv.
2025.12.08: Release the inference code and videocof-50k weight.
2025.12.06: 🔥 Project Page and README updated!

📑 Table of Contents

🎬 Revisiting Video Editing Length Extrapolation
🔧 Quick Start
🗂 Repository Layout
🏋️ Training
🚀 Inference
🏆 Model Zoo
🍭 Results
🚧 TODO
🙏 Acknowledgments
📜 License
📮 Contact
📄 Citation

🔧 Quick Start

Online Demo: Try out our enhanced VideoCoF demo on Hugging Face Spaces here!

Clone the repository:

git clone https://github.com/videocof/VideoCoF.git
cd VideoCoF

Install dependencies:

# 1. Create and activate a conda environment
conda create -n videocof python=3.10
conda activate videocof

# 2. Install PyTorch (Choose version compatible with your CUDA)
# For standard GPUs (CUDA 12.1):
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 3. Install other dependencies
pip install -r requirements.txt

Note on Flash Attention: We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the official FlashAttention-3 installation guide after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).

Download Models:

Wan-2.1-T2V-14B Pretrained Weights:

```bash
git lfs install
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B

# Or using huggingface-cli:
# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
```

VideoCoF Checkpoint & Acceleration LoRA:

```bash
git lfs install
git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

# Or using huggingface-cli:
# hf download XiangpengYang/VideoCoF --local-dir videocof_weight

# Download Acceleration LoRA (FusionX)
wget -P videocof_weight https://huggingface.co/MonsterMMORPG/Wan_GGUF/resolve/main/Wan2.1_Text_to_Video_14B_FusionX_LoRA.safetensors
```

🗂 Repository Layout

The repo has two core code paths: training (scripts/wan2.1) and inference (fast_infer.py + scripts).

VideoCoF/
├─ fast_infer.py                 # Core 4-step inference engine
├─ examples/
│  ├─ app.py                     # Gradio quick demo
│  └─ wan2.1/                    # JSON-based CoT inference helpers
├─ scripts/
│  ├─ local_style.sh
│  ├─ obj_add.sh
│  ├─ obj_rem.sh
│  ├─ obj_swap.sh
│  ├─ parallel_infer.sh          # 4 GPUs fast batch entry for fast_infer.py
│  ├─ evaluation/                # Evaluation helper scripts
│  ├─ test/                      # Legacy command snippets (main test now uses wan2.1/test_cot_lora.sh)
│  └─ wan2.1/                   # All maintained training/inference entry scripts
├─ videox_fun/                  # Core models, pipeline and utils used by Wan2.1 training/inference
└─ config/                      # Deepspeed and model configs

🏋️ Training

We keep the training code in scripts/wan2.1 with two main entry paths:

14B VideoCoF CoT LoRA (recommended for full reproduction):

export MODEL_NAME=/path/to/Wan2.1-T2V-14B
export DATASET_NAME=/path/to/VideoCoF-50k
export DATASET_META_NAME=/path/to/VideoCoF-50k/train.json
export OUTPUT_DIR=experiments/videocof_wan2.1_14b_lora

bash scripts/wan2.1/train_joint_img_cot_video_lora.sh

1.3B training (current lightweight entry kept):

export MODEL_NAME=/path/to/Wan2.1-T2V-1.3B
export DATASET_NAME=/path/to/VideoCoF-50k
export DATASET_META_NAME=/path/to/VideoCoF-50k/train.json
export OUTPUT_DIR=experiments/videocof_wan2.1_1.3b_lora

# Use train_1.3b.sh (accelerate launch around train_lora.py)
bash scripts/wan2.1/train_1.3b.sh

# Or use train_joint_img_video_lora.sh for image-video joint dataset mode
# (also 1.3B default)
bash scripts/wan2.1/train_joint_img_video_lora.sh

Validate during and after training:

Validation runs inside the training scripts (train_lora.py and train_joint_img_video_lora.py), not as a standalone command.

Evaluate ultra-long videos with 4-step DMD inference:

bash scripts/wan2.1/test_cot_lora.sh

You can test 513-frame video editing results online with 4-step inference. The demo runs under 60GB GPU memory without OOM.

See scripts/wan2.1/README_TRAIN_VIDEOCOF.md for dataset format and training notes.

🚀 Inference

🚀We provide Fast 4-step inference (Default, ~10s/video on H100) using DMD LoRA.

For single inference tasks:

# Object Removal
sh scripts/obj_rem.sh

# Object Addition
sh scripts/obj_add.sh

# Object Swap
sh scripts/obj_swap.sh

# Local Style Transfer
sh scripts/local_style.sh

For parallel inference:

sh scripts/parallel_infer.sh

Note:

scripts/test keeps historical snippets; the maintained single/multi-task inference flow uses the above files under root and scripts/wan2.1.

Gradio Demo

Launch the Gradio interface for interactive testing:

# Ensure Wan2.1-T2V-14B (model_name), videocof_weight and dmd lora are in the current directory or properly referenced
python examples/app.py

The demo supports fast inference (~10s per video) online.

🏆 Model Zoo

Our models are available on Hugging Face:

Model Name	Description	Link
VideoCoF-Base	Base model trained on 50k video pairs	Hugging Face

🍭 Results

Why We Need Reasoning Before Editing?

Current video editing methods typically follow two paths:

Expert models: Rely on external masks for precision but sacrifice unification.
Unified in-context learning models: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.

VideoCoF bridges this gap by predicting reasoning tokens before generating the target video tokens.

Key Capabilities

Seeing, Reasoning, Editing: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
Length Extrapolation: Trained on only 50k data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4× length extrapolation).
Diverse Editing Tasks: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.

Gallery Highlights

Please refer to our Project Page for the full gallery.

Object Removal: Remove people or objects based on text prompts.
Object Addition: Add elements like animals, objects, or people.
Object Swap: Change specific attributes or objects.
Local Style Transfer: Modify texture, materials or colors.

🚧 TODO

Release paper.
Release inference code and weights.
Release Hugging Face demo (~10s infer a video online), try it at Hugging Face Spaces.
Release videocof-50k training data.
Release training code.

@article{yang2025videocof,
  title={Unified Video Editing with Temporal Reasoner},
  author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
  journal={arXiv preprint arXiv:2512.07469},
  year={2025}
}

⭐ **If you find this project helpful, please consider giving it a star!** ⭐

CVPR 2026 🏆 Highlight VideoCoF: Unified Video Editing with Temporal Reasoner

👁️ See → 🧠 Reason → ✏️ Edit

🚀 A Chain of Frames video editing method enbale temporal reasoning and 16x video length extrapolation with just 50k training pairs!

CVPR 2026 🏆 Highlight
VideoCoF: Unified Video Editing with Temporal Reasoner