README.md
February 6, 2026 · View on GitHub
⭐ Update
❗ This is Luchao's implementation of Over++.
- [2026.02] Release the inference code
⚙️ Installation
- Recommended environment: python 3.10, CUDA 12.6, torch 2.9.1, diffusers 0.36.0.
- Please check requirements.txt for the dependencies.
- [Optional] Install SAM2 by following the instructions.
Model Weights
Please download the following models. For CogVideoX-Fun-V1.5-5b-InP, you may also use the code below to download it.
mkdir -p models/Diffusion_Transformer
cd models/Diffusion_Transformer
git lfs install
git clone https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
| Model | Download link | Description |
|---|---|---|
| CogVideoX-Fun-V1.5-5b-InP | huggingface | Pre-trained inpainting model containing vae, encoder, transformer, etc. Please follow the instructions in videox_fun and save it to models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP. |
| Over++ | huggingface | The model is a transformer module fine-tuned from VideoX-Fun's released inpainting model. This is not a full model, it only replaces the transformer weights of the CogVideoX-Fun-V1.5-5b-InP pipeline above. Please download the transformer weights and save to /PATH/TO/CHECKPOINTS |
⚡ Inference
Input Format
Prepare your input data in the following structure:
examples/
├── your-sequence-name/
│ ├── input_video.mp4 # Input video without effects
│ ├── trimask_00.mp4 # Mask with white for effect regions, black for unchanged regions, and gray for unknown regions (can be full gray mask video if no specific regions are to be changed)
│ ├── trimask_01.mp4 # Additional mask for inference (optional)
│ └── prompt.json # Text prompt: {"bg": "A kid in rain boots runs through puddles, sending turbulent water splashing in all directions, with sprays shooting high into the air."}
Single GPU Inference
You can run inference on a single GPU by running the following command example:
python inference/cogvideox_fun/predict_v2v.py \
--config.experiment.save_foreground=True \
--config.experiment.save_path="output_temp" \
--config.data.data_rootdir="examples" \
--config.experiment.run_seqs="boy-water,pexles_car_drift" \
--config.experiment.skip_if_exists=False \
--config.data.dilate_width=0 \
--config.video_model.guidance_scale=6 \
--config.video_model.transformer_path="PATH/TO/CHECKPOINTS/diffusion_pytorch_model.safetensors"
Note: The
guidance_scaleis the CFG parameter that controls the trade-off between effect generation and content preservation. A higher value results in stronger effect generation but may also alter the original color tone. We recommend using a value between 2 and 20 (default: 6). A more advanced CFG formulation, such as PickStyle - Video-to-Video Style Transfer | Pickford AI, may address this issue in the future.
Due to torch version compatibility issues, you may consider usingexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueor installingflash-attnif having OOM errors during inference.
Multi-GPU Batch Inference
You can run inference on multiple GPUs by running the following command example:
python inference/helper_inference_multi_gpu.py \
--input_dir 'examples' \
--output_dir 'output/your_output_dir' \
--n_gpus 8 \
--transformer_path 'PATH/TO/Over++/transformer.safetensors' \
--prompt_guidance_scale 6
📷 Dataset
We construct a diverse training dataset combining paired and unpaired videos to enable effective effect generation while preserving the base model's text-to-video capabilities.
Please refer to ./dataset for more details.
🎓 Training
We follow the procedure here to fine-tune CogVideoX-5B inpainting model on 8 NVIDIA A6000 GPUs:
accelerate launch --use_deepspeed --deepspeed_config_file config/zero_stage2_config.json --deepspeed_multinode_launcher standard scripts/cogvideox_fun/train.py \
--pretrained_model_name_or_path="PATH/TO/models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP" \
--train_data_meta="PATH/TO/datasets/train-casper/casper.json" \
--image_sample_size=512 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=1 \
--video_sample_n_frames=85 \
--train_batch_size=1 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--num_train_epochs=5 \
--checkpointing_steps=1000 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--random_frame_crop \
--enable_bucket \
--use_came \
--use_deepspeed \
--train_mode="casper" \
--dataloader_num_workers=0 \
--report_to="wandb" \
--trainable_modules "." \
--binarize_mask \
--output_dir="training/PATH/TO/OUTPUT_DIR"
🙏 Acknowledgement
We thank the authors of VideoX-Fun, SAM2, and gen-omnimatte for their shared codes and models.
We also appreciate the results from Omnimatte, Omnimatte3D, OmnimatteRF, and OmnimatteZero, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13], which were used for fine-tuning Over++.
The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.
📑 Citation
If you find our repo useful for your research, please consider citing our paper:
@misc{qi2025overgenerativevideocompositing,
title={Over++: Generative Video Compositing for Layer Interaction Effects},
author={Luchao Qi and Jiaye Wu and Jun Myeong Choi and Cary Phillips and Roni Sengupta and Dan B Goldman},
year={2025},
eprint={2512.19661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.19661},
}