README.md

February 24, 2026 · View on GitHub

A flexible framework to bridge video understanding, generation and editing

Hao Yang², Zhiyu Tan^1,2†, Jia Gong², Luozheng Qin², Hesen Chen^1,2, Xiaomeng Yang², Yuqing Sun², Yuetan Lin², Mengping Yang^2*, Hao Li^1,2*

¹Fudan University | ²Shanghai Academy of Artificial Intelligence for Science
^*Corresponding Author ^†Project Lead

🔥 Latest News

February 12, 2026: 🔥🔥 We release the Technical Report of Omni-Video 2 on arXiv!
February 12, 2026: 🔥🔥 We are glad to release a more light model OmniVideo2-1.3B, it will be much smaller and quicker, and the performance is still qualitative!
January 22, 2026: 🔥🔥 The whole new OmniVideo2 is released now!
August 6, 2025: We are glad to release v0.1's code, which includes support for both inference and fine-tuning!
August 6, 2025: Our version v0.1 model is uploaded to HF Model now!
July 7, 2025: We release the Technique-Report of Omni-Video
July 7, 2025: We release the project page of Omni-Video

We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.

Framework

Video Editing Demos

Note: Left side shows the source video, right side shows the edited result.

Advanced Video Editing

Complex Edit

Multi-element transformations combining appearance, lighting, and environmental changes.

Change the man's black jacket to a tattered gray overcoat, replace the wall with faded blue wallpaper	Change the woman's red shirt to glowing neon cyan, transform window glow to electric blue moonlight
Change the man's black jacket to a gray coat with glowing thread, replace blue light with warm amber	Change workout attire to vibrant crimson sports bra and leggings, replace towel with flowing silk scarf

High Motion

Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.

Change the woman's black top to a flowing blood-red silk gown that billows with motion	Change the woman's green jacket to a deep crimson cloak that billows dramatically
Change the armored suit from red-and-black to matte charcoal gray with cyan circuitry accents	Change the woman's white shirt to a blood-red silk blouse that clings to her form

Diverse Local Edit

Precise object-level modifications while preserving surrounding context and motion.

Change the real raccoon to a stuffed raccoon	Change the firefighter's pizza to a steaming cup of coffee
Change the light brown fur to deep obsidian-black fur with icy blue ethereal mist	Change the golden retriever to a black Labrador

Basic Video Editing

Add

Adding objects and accessories to videos.

Add a scarf around the first fox's neck	Add a tiny pirate hat on the parrot's head
Add a red headband to the player's forehead	Add a tiny crown to the hummingbird's head

Remove

Removing elements from videos while maintaining scene coherence.

Remove the meditation cushion from the scene	Remove the two cubs from the scene
Remove the two lizards from the scene	Remove the black cat from the scene

Local Change

Local attribute changes on specific objects.

Change the woman's white dress to a blood-stained black gown	Change the fox into a badger
Change the man with thick beard to a woman with short silver hair	Change the engineer's navy jacket to a bright crimson trench coat

Project Structure

omnivideo2_release/
├── omnivideo/
│   ├── configs/           # Model configurations
│   ├── distributed/       # FSDP and sequence parallel utilities
│   ├── modules/           # Core model components (attention, VAE, T5, etc.)
│   ├── utils/             # Utility functions and solvers
│   ├── vllm_model.py      # Qwen3-VL integration
│   └── x2x_gen_unified.py # Main generation pipeline
└── tools/
    └── inference/
        ├── generate_omni_v2v.py    # Inference script
        └── inference_omni_v2v.sh   # Shell launcher

Environment Setup

Requirements

Python >= 3.10
PyTorch >= 2.8 with CUDA support
NVIDIA GPU with sufficient VRAM (recommended: 80GB for A14B model)

Installation

Clone the repository:

git clone https://github.com/your-org/omnivideo2.git
cd omnivideo2

Create a conda environment:

conda create -n omnivideo2 python=3.10
conda activate omnivideo2

Install dependencies:

pip install -r requirements.txt
pip install flash-attn --no-build-isolation  # Optional but recommended for faster attention

Model Checkpoints

Download the pretrained checkpoints and organize them as follows:

${CKPT_DIR}/
├── high_noise_model/
│   └── model.pt              # High-noise timestep model
├── low_noise_model/
│   └── model.pt              # Low-noise timestep model
├── special_tokens.pkl        # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth  # T5 encoder
└── Wan2.1_VAE.pth            # VAE model

You will also need the Qwen3-VL model for visual feature extraction:

Download from: Qwen3-VL-30B-A3B-Instruct

Inference

Prepare Input Data

Create a JSONL file with your prompts. Each line should be a JSON object:

For Video-to-Video editing:

{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}

Run Inference

Edit the configuration in tools/inference/inference_omni_e2e.sh:

# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"

# Adjust generation parameters as needed
GEN_SIZE="832*480"       # Video resolution (width*height)
GEN_FRAME_NUM=41         # Number of frames
GEN_SAMPLE_FPS=8         # Output FPS
GEN_TASK="v2v-A14B"      # Task type: v2v-A14B or t2v-A14B or v2v-A1.3B or t2v-A1.3B

Run the inference script:

## for OmniVideo2-A14B 
bash tools/inference/inference_omni_e2e.sh
## for OmniVideo2-1.3B
bash tools/inference/inference_omni_v2v_1_3B.sh

Available Tasks

Task	Description
`t2v-A14B`	Text-to-Video generation with OmniVideo2-A14B model
`v2v-A14B`	Video-to-Video editing with OmniVideo2-A14B model
`t2v-1.3B`	Text-to-Video generation with OmniVideo2-1.3B model
`v2v-1.3B`	Video-to-Video editing with OmniVideo2-1.3B model

Generation Parameters

Parameter	Default	Description
`--size`	`832*480`	Output video resolution (width*height)
`--frame_num`	`41`	Number of frames to generate
`--sample_fps`	`8`	Output video FPS
`--sample_steps`	`40`	Number of diffusion sampling steps
`--sample_guide_scale`	`3.0`	Classifier-free guidance scale
`--sample_shift`	`5`	Noise schedule shift parameter
`--sample_solver`	`unipc`	Sampling solver (unipc, ddim, euler)

Acknowledgements

We sincerely thank the following teams for their outstanding contributions that made this project possible:

Wan Team: For the foundational video generation architecture, VAE model, and diffusion framework.
Qwen-VL Team: For the powerful Qwen3-VL vision-language model.

License

Please refer to the LICENSE file for details.

Citation

If you find this work useful, please consider citing:

@article{yang2026omnivideo2,
  title={Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing},
  author={Yang, Hao and Tan, Zhiyu and Gong, Jia and Qin, Luozheng and Chen, Hesen and Yang, Xiaomeng and Sun, Yuqing and Lin, Yuetan and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2602.08820},
  year={2026}
}
@article{tan2025omni,
  title={Omni-Video: Democratizing Unified Video Understanding and Generation},
  author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2507.06119},
  year={2025}
}