UniVideo: Unified Understanding, Generation, and Editing for Videos
June 4, 2026 · View on GitHub
Cong Wei*,1,2 Quande Liu†,2 Zixuan Ye2 Qiulin Wang2 Xintao Wang2
Pengfei Wan2 Kun Gai2 Wenhu Chen†,1
1University of Waterloo
2Kling Team, Kuaishou Technology
*Work done during an internship at Kling Team, Kuaishou Technology
†Corresponding author

🚀 Supported Tasks
Univideo is flexible in its input and output configurations, supporting a wide range of multimodal tasks:
| Task | Input Type | Output | Task ID | Description | Demo Input | Demo Output |
|---|---|---|---|---|---|---|
| Image/Video Understanding | Image🖼️ / Video🎬 + Text📝 | Text📝 | understanding |
Multimodal analysis and captioning. |
![]() |
Text |
| Text-to-Image | Text📝 | Image🖼️ | t2i |
Generating images from text prompts. | Prompt |
![]() |
| Text-to-Video | Text📝 | Video🎬 | t2v |
Generating videos from text prompts. | Prompt |
![]() |
| Image-to-Video | Image🖼️ + Text📝 | Video🎬 | i2v |
Animating a static image into a video. |
![]() |
![]() |
| Image Editing | Image🖼️ + Text📝 | Image🖼️ | i2i_edit |
Instruction-based image editing. |
![]() |
![]() |
| In-context Image Editing | Image🖼️ + Image🖼️ + Text📝 | Image🖼️ | i+i2i_edit |
Editing an image based on a reference image. |
|
![]() |
| In-context Generation | Image🖼️ × N + Text📝 | Image🖼️ / Video🎬 | multiid |
Multi-subject generation. |
|
|
| Video Editing | Video🎬 + Text📝 | Video🎬 | v2v_edit |
Instruction-based video manipulation and stylization. |
![]() |
![]() |
| In-context Video Editing | Image🖼️ + Video🎬 + Text📝 | Video🎬 | i+v2v_edit |
Reference-based manipulation: addition, deletion, swapping, and stylization. |
![]() ![]() |
|
🔔News
- [2026-06-03]: The training script and instructions are now available in TRAINING.md.
- [2026-01-30]: UniVideo was accepted at ICLR 2026 🎉
- [2026-01-07]: Released Code and Model.
- [2025-10-09]: Released Arxiv Preprint and the Project Page
How to use
1. Installation
conda env create -f environment.yml
conda activate univideo
This environment is tested with:
- Python 3.11
- PyTorch 2.4.1 + CUDA 12.1
- diffusers 0.34.0
- transformers 4.51.3
Try this command if the conda create from yaml doesn't work
conda create -n univideo python=3.11 -y
conda activate univideo
conda install pytorch==2.4.1 torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
2. Download Checkpoint
Download the Univideo checkpoint to a local path for example ckpts/:
python download_ckpt.py --variant hidden
We provide two UniVideo checkpoint variants as described in Arxiv Preprint Section 3.2:
-
Variant 1 (img, video, txt -> mllm -> last layer hidden -> mmdit)
Image, video, and text inputs are processed by the MLLM, and the final hidden states are fed into the MMDiT backbone. -
Variant 2 (img, video, txt, queries -> mllm -> txt + queries last layer hidden -> mmdit)
Image, video, text, and queries are processed by the MLLM. The final hidden states of text and queries are used as inputs to MMDiT.
Download the queries-based checkpoint with:
python download_ckpt.py --variant queries
Or download both variants without deleting either local directory:
python download_ckpt.py --variant all
3. Inference
We provide demo inference scripts to demonstrate how to load and run the UniVideo pipeline by setting up pipeline_kwargs on different inputs. Feel free to adapt these to your own inputs and setup.
1. Basic Understanding & Generation
# Image/Video Captioning & Understanding
python univideo_inference.py --demo_task understanding --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Text-to-Video (T2V)
python univideo_inference.py --demo_task t2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Text-to-Image (T2I)
python univideo_inference.py --demo_task t2i --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Image-to-Video (I2V)
python univideo_inference.py --demo_task i2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
2. Instruction-based Editing
# Image Editing
python univideo_inference.py --demo_task image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Video Editing
python univideo_inference.py --demo_task video_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Video Stylization
python univideo_inference.py --demo_task stylization --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
3. In-Context Tasks
# In context video generation
python univideo_inference.py --demo_task in_context_video_gen --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# In context image editing
python univideo_inference.py --demo_task in_context_image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# In context video editing
## addition
python univideo_inference.py --demo_task in_context_video_edit_addition --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## swap
python univideo_inference.py --demo_task in_context_video_edit_swap --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## style
python univideo_inference.py --demo_task in_context_video_edit_style --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
4. Multi-GPU README Sweep
To run the README demo tasks across multiple local GPUs while keeping each task's default hyperparameters, use:
python scripts/run_readme_inference_sweep.py \
--gpus 0,1,2 \
--max-parallel 3 \
--config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml \
--output-root outputs/readme-inference
The launcher writes one log per task under outputs/readme-inference/logs.
Lower --max-parallel if checkpoint loading saturates local storage.
Univideo variant 2
To use the Queries-based version of UniVideo, simply update the configuration flag.
--config configs/univideo_qwen2p5vl7b_queries_hunyuanvideo.yaml
4. Training
We provide an example training setting using open-source data so users can run a small training job and verify the training pipeline. See TRAINING.md for the data schema, dataset preparation details, and full training options.
python download_ckpt.py --variant hidden
python -m pip install --target .deps/pyarrow pyarrow
bash scripts/prepare_smoke_data.sh
torchrun --standalone --nproc_per_node 8 \
train/train_univideo.py configs/train_multitask_129f_hybrid_smoke.yaml
5. Evaluation
We provide the scripts for evaluating UniVideo on GenEval, ImgEdit, GEdit and Vbench benchmarks. Check out EVAL.md
Acknowledgement
- HunyuanVideo: the base video generation model used in this work. Thanks to the authors for their excellent contribution.
- Qwen2.5-VL: the base vlm model used in this work. Thanks to the authors for their excellent contribution.
- MetaQueries: we adopt their query implementation. Thanks to the authors for their excellent contribution.
🌟 Citation
If you find UniVideo useful for your research and applications, please cite using this BibTeX:
@article{wei2025univideo,
title={Univideo: Unified understanding, generation, and editing for videos},
author={Wei, Cong and Liu, Quande and Ye, Zixuan and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Gai, Kun and Chen, Wenhu},
journal={arXiv preprint arXiv:2510.08377},
year={2025}
}











