README.md
February 24, 2026 · View on GitHub
A flexible framework to bridge video understanding, generation and editing
Hao Yang2, Zhiyu Tan1,2†, Jia Gong2, Luozheng Qin2, Hesen Chen1,2, Xiaomeng Yang2, Yuqing Sun2, Yuetan Lin2, Mengping Yang2*, Hao Li1,2*
1Fudan University | 2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author †Project Lead
🔥 Latest News
- February 12, 2026: 🔥🔥 We release the Technical Report of Omni-Video 2 on arXiv!
- February 12, 2026: 🔥🔥 We are glad to release a more light model OmniVideo2-1.3B, it will be much smaller and quicker, and the performance is still qualitative!
- January 22, 2026: 🔥🔥 The whole new OmniVideo2 is released now!
- August 6, 2025: We are glad to release v0.1's code, which includes support for both inference and fine-tuning!
- August 6, 2025: Our version v0.1 model is uploaded to HF Model now!
- July 7, 2025: We release the Technique-Report of Omni-Video
- July 7, 2025: We release the project page of Omni-Video
Introduction
We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.
Framework
Video Editing Demos
Note: Left side shows the source video, right side shows the edited result.
Advanced Video Editing
Complex Edit
Multi-element transformations combining appearance, lighting, and environmental changes.
|
Change the man's black jacket to a tattered gray overcoat, replace the wall with faded blue wallpaper
|
Change the woman's red shirt to glowing neon cyan, transform window glow to electric blue moonlight
|
|
Change the man's black jacket to a gray coat with glowing thread, replace blue light with warm amber
|
Change workout attire to vibrant crimson sports bra and leggings, replace towel with flowing silk scarf
|
High Motion
Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.
|
Change the woman's black top to a flowing blood-red silk gown that billows with motion
|
Change the woman's green jacket to a deep crimson cloak that billows dramatically
|
|
Change the armored suit from red-and-black to matte charcoal gray with cyan circuitry accents
|
Change the woman's white shirt to a blood-red silk blouse that clings to her form
|
Diverse Local Edit
Precise object-level modifications while preserving surrounding context and motion.
|
Change the real raccoon to a stuffed raccoon
|
Change the firefighter's pizza to a steaming cup of coffee
|
|
Change the light brown fur to deep obsidian-black fur with icy blue ethereal mist
|
Change the golden retriever to a black Labrador
|
Basic Video Editing
Add
Adding objects and accessories to videos.
|
Add a scarf around the first fox's neck
|
Add a tiny pirate hat on the parrot's head
|
|
Add a red headband to the player's forehead
|
Add a tiny crown to the hummingbird's head
|
Remove
Removing elements from videos while maintaining scene coherence.
|
Remove the meditation cushion from the scene
|
Remove the two cubs from the scene
|
|
Remove the two lizards from the scene
|
Remove the black cat from the scene
|
Local Change
Local attribute changes on specific objects.
|
Change the woman's white dress to a blood-stained black gown
|
Change the fox into a badger
|
|
Change the man with thick beard to a woman with short silver hair
|
Change the engineer's navy jacket to a bright crimson trench coat
|
Project Structure
omnivideo2_release/
├── omnivideo/
│ ├── configs/ # Model configurations
│ ├── distributed/ # FSDP and sequence parallel utilities
│ ├── modules/ # Core model components (attention, VAE, T5, etc.)
│ ├── utils/ # Utility functions and solvers
│ ├── vllm_model.py # Qwen3-VL integration
│ └── x2x_gen_unified.py # Main generation pipeline
└── tools/
└── inference/
├── generate_omni_v2v.py # Inference script
└── inference_omni_v2v.sh # Shell launcher
Environment Setup
Requirements
- Python >= 3.10
- PyTorch >= 2.8 with CUDA support
- NVIDIA GPU with sufficient VRAM (recommended: 80GB for A14B model)
Installation
- Clone the repository:
git clone https://github.com/your-org/omnivideo2.git
cd omnivideo2
- Create a conda environment:
conda create -n omnivideo2 python=3.10
conda activate omnivideo2
- Install dependencies:
pip install -r requirements.txt
pip install flash-attn --no-build-isolation # Optional but recommended for faster attention
Model Checkpoints
Download the pretrained checkpoints and organize them as follows:
${CKPT_DIR}/
├── high_noise_model/
│ └── model.pt # High-noise timestep model
├── low_noise_model/
│ └── model.pt # Low-noise timestep model
├── special_tokens.pkl # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth # T5 encoder
└── Wan2.1_VAE.pth # VAE model
You will also need the Qwen3-VL model for visual feature extraction:
- Download from: Qwen3-VL-30B-A3B-Instruct
Inference
Prepare Input Data
Create a JSONL file with your prompts. Each line should be a JSON object:
For Video-to-Video editing:
{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}
Run Inference
- Edit the configuration in
tools/inference/inference_omni_e2e.sh:
# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"
# Adjust generation parameters as needed
GEN_SIZE="832*480" # Video resolution (width*height)
GEN_FRAME_NUM=41 # Number of frames
GEN_SAMPLE_FPS=8 # Output FPS
GEN_TASK="v2v-A14B" # Task type: v2v-A14B or t2v-A14B or v2v-A1.3B or t2v-A1.3B
- Run the inference script:
## for OmniVideo2-A14B
bash tools/inference/inference_omni_e2e.sh
## for OmniVideo2-1.3B
bash tools/inference/inference_omni_v2v_1_3B.sh
Available Tasks
| Task | Description |
|---|---|
t2v-A14B | Text-to-Video generation with OmniVideo2-A14B model |
v2v-A14B | Video-to-Video editing with OmniVideo2-A14B model |
t2v-1.3B | Text-to-Video generation with OmniVideo2-1.3B model |
v2v-1.3B | Video-to-Video editing with OmniVideo2-1.3B model |
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
--size | 832*480 | Output video resolution (width*height) |
--frame_num | 41 | Number of frames to generate |
--sample_fps | 8 | Output video FPS |
--sample_steps | 40 | Number of diffusion sampling steps |
--sample_guide_scale | 3.0 | Classifier-free guidance scale |
--sample_shift | 5 | Noise schedule shift parameter |
--sample_solver | unipc | Sampling solver (unipc, ddim, euler) |
Acknowledgements
We sincerely thank the following teams for their outstanding contributions that made this project possible:
-
Wan Team: For the foundational video generation architecture, VAE model, and diffusion framework.
-
Qwen-VL Team: For the powerful Qwen3-VL vision-language model.
License
Please refer to the LICENSE file for details.
Citation
If you find this work useful, please consider citing:
@article{yang2026omnivideo2,
title={Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing},
author={Yang, Hao and Tan, Zhiyu and Gong, Jia and Qin, Luozheng and Chen, Hesen and Yang, Xiaomeng and Sun, Yuqing and Lin, Yuetan and Yang, Mengping and Li, Hao},
journal={arXiv preprint arXiv:2602.08820},
year={2026}
}
@article{tan2025omni,
title={Omni-Video: Democratizing Unified Video Understanding and Generation},
author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
journal={arXiv preprint arXiv:2507.06119},
year={2025}
}