UniVideo: Unified Understanding, Generation, and Editing for Videos

June 4, 2026 · View on GitHub

¹University of Waterloo ²Kling Team, Kuaishou Technology
^*Work done during an internship at Kling Team, Kuaishou Technology ^†Corresponding author

🚀 Supported Tasks

Univideo is flexible in its input and output configurations, supporting a wide range of multimodal tasks:

Task	Input Type	Output	Task ID	Description	Demo Input	Demo Output
Image/Video Understanding	Image🖼️ / Video🎬 + Text📝	Text📝	`understanding`	Multimodal analysis and captioning.		_Text
Text-to-Image	Text📝	Image🖼️	`t2i`	Generating images from text prompts.	_Prompt
Text-to-Video	Text📝	Video🎬	`t2v`	Generating videos from text prompts.	_Prompt
Image-to-Video	Image🖼️ + Text📝	Video🎬	`i2v`	Animating a static image into a video.
Image Editing	Image🖼️ + Text📝	Image🖼️	`i2i_edit`	Instruction-based image editing.
In-context Image Editing	Image🖼️ + Image🖼️ + Text📝	Image🖼️	`i+i2i_edit`	Editing an image based on a reference image.
In-context Generation	Image🖼️ × N + Text📝	Image🖼️ / Video🎬	`multiid`	Multi-subject generation.
Video Editing	Video🎬 + Text📝	Video🎬	`v2v_edit`	Instruction-based video manipulation and stylization.
In-context Video Editing	Image🖼️ + Video🎬 + Text📝	Video🎬	`i+v2v_edit`	Reference-based manipulation: addition, deletion, swapping, and stylization.

🔔News

[2026-06-03]: The training script and instructions are now available in TRAINING.md.
[2026-01-30]: UniVideo was accepted at ICLR 2026 🎉
[2026-01-07]: Released Code and Model.
[2025-10-09]: Released Arxiv Preprint and the Project Page

How to use

1. Installation

conda env create -f environment.yml
conda activate univideo

This environment is tested with:

Python 3.11
PyTorch 2.4.1 + CUDA 12.1
diffusers 0.34.0
transformers 4.51.3

Try this command if the conda create from yaml doesn't work

conda create -n univideo python=3.11 -y
conda activate univideo
conda install pytorch==2.4.1 torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt

2. Download Checkpoint

Download the Univideo checkpoint to a local path for example ckpts/:

python download_ckpt.py --variant hidden

We provide two UniVideo checkpoint variants as described in Arxiv Preprint Section 3.2:

Variant 1 (img, video, txt -> mllm -> last layer hidden -> mmdit)
Image, video, and text inputs are processed by the MLLM, and the final hidden states are fed into the MMDiT backbone.
Variant 2 (img, video, txt, queries -> mllm -> txt + queries last layer hidden -> mmdit)
Image, video, text, and queries are processed by the MLLM. The final hidden states of text and queries are used as inputs to MMDiT.

Download the queries-based checkpoint with:

python download_ckpt.py --variant queries

Or download both variants without deleting either local directory:

python download_ckpt.py --variant all

3. Inference

We provide demo inference scripts to demonstrate how to load and run the UniVideo pipeline by setting up pipeline_kwargs on different inputs. Feel free to adapt these to your own inputs and setup.

1. Basic Understanding & Generation

# Image/Video Captioning & Understanding
python univideo_inference.py --demo_task understanding --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Video (T2V)
python univideo_inference.py --demo_task t2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Image (T2I)
python univideo_inference.py --demo_task t2i --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Image-to-Video (I2V)
python univideo_inference.py --demo_task i2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

2. Instruction-based Editing

# Image Editing 
python univideo_inference.py --demo_task image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Editing
python univideo_inference.py --demo_task video_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Stylization
python univideo_inference.py --demo_task stylization --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

3. In-Context Tasks

# In context video generation
python univideo_inference.py --demo_task in_context_video_gen --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context image editing
python univideo_inference.py --demo_task in_context_image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context video editing
## addition
python univideo_inference.py --demo_task in_context_video_edit_addition --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## swap
python univideo_inference.py --demo_task in_context_video_edit_swap --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## style
python univideo_inference.py --demo_task in_context_video_edit_style --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

4. Multi-GPU README Sweep

To run the README demo tasks across multiple local GPUs while keeping each task's default hyperparameters, use:

python scripts/run_readme_inference_sweep.py \
  --gpus 0,1,2 \
  --max-parallel 3 \
  --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml \
  --output-root outputs/readme-inference

The launcher writes one log per task under outputs/readme-inference/logs. Lower --max-parallel if checkpoint loading saturates local storage.

Univideo variant 2

To use the Queries-based version of UniVideo, simply update the configuration flag.

--config configs/univideo_qwen2p5vl7b_queries_hunyuanvideo.yaml

We provide an example training setting using open-source data so users can run a small training job and verify the training pipeline. See TRAINING.md for the data schema, dataset preparation details, and full training options.

python download_ckpt.py --variant hidden
python -m pip install --target .deps/pyarrow pyarrow
bash scripts/prepare_smoke_data.sh
torchrun --standalone --nproc_per_node 8 \
  train/train_univideo.py configs/train_multitask_129f_hybrid_smoke.yaml

5. Evaluation

We provide the scripts for evaluating UniVideo on GenEval, ImgEdit, GEdit and Vbench benchmarks. Check out EVAL.md

Acknowledgement

HunyuanVideo: the base video generation model used in this work. Thanks to the authors for their excellent contribution.
Qwen2.5-VL: the base vlm model used in this work. Thanks to the authors for their excellent contribution.
MetaQueries: we adopt their query implementation. Thanks to the authors for their excellent contribution.

🌟 Citation

If you find UniVideo useful for your research and applications, please cite using this BibTeX:

@article{wei2025univideo,
  title={Univideo: Unified understanding, generation, and editing for videos},
  author={Wei, Cong and Liu, Quande and Ye, Zixuan and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Gai, Kun and Chen, Wenhu},
  journal={arXiv preprint arXiv:2510.08377},
  year={2025}
}