README.md

February 6, 2026 · View on GitHub

Over++: Generative Video Compositing for Layer Interaction Effects

⭐ Update

❗ This is Luchao's implementation of Over++.

[2026.02] Release the inference code

⚙️ Installation

Recommended environment: python 3.10, CUDA 12.6, torch 2.9.1, diffusers 0.36.0.
Please check requirements.txt for the dependencies.
[Optional] Install SAM2 by following the instructions.

Model Weights

Please download the following models. For CogVideoX-Fun-V1.5-5b-InP, you may also use the code below to download it.

mkdir -p models/Diffusion_Transformer
cd models/Diffusion_Transformer
git lfs install
git clone https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP

Model	Download link	Description
CogVideoX-Fun-V1.5-5b-InP	huggingface	Pre-trained inpainting model containing vae, encoder, transformer, etc. Please follow the instructions in videox_fun and save it to `models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP`.
Over++	huggingface	The model is a transformer module fine-tuned from VideoX-Fun's released inpainting model. This is not a full model, it only replaces the transformer weights of the CogVideoX-Fun-V1.5-5b-InP pipeline above. Please download the transformer weights and save to /PATH/TO/CHECKPOINTS

⚡ Inference

Input Format

Prepare your input data in the following structure:

examples/
├── your-sequence-name/
│   ├── input_video.mp4         # Input video without effects
│   ├── trimask_00.mp4          # Mask with white for effect regions, black for unchanged regions, and gray for unknown regions (can be full gray mask video if no specific regions are to be changed)
│   ├── trimask_01.mp4          # Additional mask for inference (optional)
│   └── prompt.json             # Text prompt: {"bg": "A kid in rain boots runs through puddles, sending turbulent water splashing in all directions, with sprays shooting high into the air."}

Single GPU Inference

You can run inference on a single GPU by running the following command example:

python inference/cogvideox_fun/predict_v2v.py \
    --config.experiment.save_foreground=True \
    --config.experiment.save_path="output_temp" \
    --config.data.data_rootdir="examples" \
    --config.experiment.run_seqs="boy-water,pexles_car_drift" \
    --config.experiment.skip_if_exists=False \
    --config.data.dilate_width=0 \
    --config.video_model.guidance_scale=6 \
    --config.video_model.transformer_path="PATH/TO/CHECKPOINTS/diffusion_pytorch_model.safetensors"

Note: The guidance_scale is the CFG parameter that controls the trade-off between effect generation and content preservation. A higher value results in stronger effect generation but may also alter the original color tone. We recommend using a value between 2 and 20 (default: 6). A more advanced CFG formulation, such as PickStyle - Video-to-Video Style Transfer | Pickford AI, may address this issue in the future.
Due to torch version compatibility issues, you may consider using export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True or installing flash-attn if having OOM errors during inference.

Multi-GPU Batch Inference

You can run inference on multiple GPUs by running the following command example:

python inference/helper_inference_multi_gpu.py \
    --input_dir 'examples' \
    --output_dir 'output/your_output_dir' \
    --n_gpus 8 \
    --transformer_path 'PATH/TO/Over++/transformer.safetensors' \
    --prompt_guidance_scale 6

📷 Dataset

We construct a diverse training dataset combining paired and unpaired videos to enable effective effect generation while preserving the base model's text-to-video capabilities.

Please refer to ./dataset for more details.

🎓 Training

We follow the procedure here to fine-tune CogVideoX-5B inpainting model on 8 NVIDIA A6000 GPUs:

accelerate launch --use_deepspeed --deepspeed_config_file config/zero_stage2_config.json --deepspeed_multinode_launcher standard scripts/cogvideox_fun/train.py \
  --pretrained_model_name_or_path="PATH/TO/models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP" \
  --train_data_meta="PATH/TO/datasets/train-casper/casper.json" \
  --image_sample_size=512 \
  --video_sample_size=256 \
  --token_sample_size=512 \
  --video_sample_stride=1 \
  --video_sample_n_frames=85 \
  --train_batch_size=1 \
  --video_repeat=1 \
  --gradient_accumulation_steps=1 \
  --num_train_epochs=5 \
  --checkpointing_steps=1000 \
  --learning_rate=2e-05 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --seed=42 \
  --gradient_checkpointing \
  --mixed_precision="bf16" \
  --adam_weight_decay=3e-2 \
  --adam_epsilon=1e-10 \
  --vae_mini_batch=1 \
  --max_grad_norm=0.05 \
  --random_hw_adapt \
  --training_with_video_token_length \
  --random_frame_crop \
  --enable_bucket \
  --use_came \
  --use_deepspeed \
  --train_mode="casper" \
  --dataloader_num_workers=0 \
  --report_to="wandb" \
  --trainable_modules "." \
  --binarize_mask \
  --output_dir="training/PATH/TO/OUTPUT_DIR"

🙏 Acknowledgement

We thank the authors of VideoX-Fun, SAM2, and gen-omnimatte for their shared codes and models.

We also appreciate the results from Omnimatte, Omnimatte3D, OmnimatteRF, and OmnimatteZero, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13], which were used for fine-tuning Over++.

The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.

📑 Citation

If you find our repo useful for your research, please consider citing our paper:

@misc{qi2025overgenerativevideocompositing,
  title={Over++: Generative Video Compositing for Layer Interaction Effects},
  author={Luchao Qi and Jiaye Wu and Jun Myeong Choi and Cary Phillips and Roni Sengupta and Dan B Goldman},
  year={2025},
  eprint={2512.19661},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.19661},
}