Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA οΌCVPR 2026 FindingsοΌ
April 6, 2026 Β· View on GitHub
Customizing a dedicated semantic LoRA for each reference video. π arXiv
Official implementation of the paper "Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA".
Video2LoRA enables semantic video generation by dynamically predicting lightweight LoRA adapters from reference videos using a HyperNetwork, without requiring per-condition fine-tuning.
π₯ Highlights
Video2LoRA introduces a new paradigm for semantic-controlled video generation.
Instead of training separate models or LoRA adapters for each semantic condition (e.g., visual effects, camera motion, style), our framework predicts semantic-specific LoRA weights directly from a reference video.
Key features:
- π¬ Reference-driven semantic video generation
- β‘ Ultra-lightweight LoRA (<50 KB per semantic condition)
- π§ Transformer-based HyperNetwork for LoRA prediction
- π Strong zero-shot generalization
- π§© Unified framework across heterogeneous semantic controls
π§ Method Overview

Video2LoRA consists of three key components:
1. LightLoRA Representation
We introduce LightLoRA, a compact LoRA formulation that decomposes the standard LoRA matrices:
Where:
- : trainable auxiliary matrices
- : predicted by the HyperNetwork
This design significantly reduces parameter size while preserving semantic adaptability.
Each semantic condition requires less than 50 KB parameters.
2. HyperNetwork for LoRA Prediction
A Transformer-based HyperNetwork predicts semantic-specific LoRA weights conditioned on a reference video.
Pipeline:
Reference Video
β
3D VAE Encoder
β
Spatio-temporal features
β
Transformer Decoder
β
Predicted LoRA weights
These predicted LoRA modules are injected into the frozen diffusion backbone.
3. End-to-End Diffusion Training
Unlike prior methods that require:
- pretrained semantic LoRA weights
- multi-stage training pipelines
Video2LoRA is trained end-to-end using only the standard diffusion objective.
π Zero-Shot Semantic Generation
Video2LoRA generalizes well to unseen semantic conditions.
Even when encountering out-of-domain visual effects, the model can generate semantically aligned videos based on reference videos.
Example semantic controls include:
- visual effects (VFX)
- camera motion
- object stylization
- character transformations
- artistic styles
π Dataset
Video2LoRA follows the dataset format used in VideoX-Fun, which supports mixed image and video training with text descriptions.
Organize your dataset in the following structure:
project/
β
βββ datasets/
β βββ internal_datasets/
β β
β βββ train/
β β βββ 00000001.mp4
β β βββ 00000002.jpg
β β βββ 00000003.mp4
β β βββ ...
β β
β βββ json_of_internal_datasets.json
π JSON Annotation Format
[
{
"file_path": "train/00000001.mp4",
"text": "A group of young men in suits and sunglasses walking down a city street.",
"type": "video"
},
{
"file_path": "train/00000002.jpg",
"text": "A group of young men in suits and sunglasses walking down a city street.",
"type": "image"
}
]
βοΈ Installation
Clone repository
git clone https://github.com/BerserkerVV/Video2LoRA.git
cd Video2LoRA
Create environment
conda create -n video2lora python=3.10
conda activate video2lora
Install dependencies
pip install -r requirements.txt
π Training
Train Video2LoRA:
bash scripts/cogvideoxfun/train_lora.sh
Training setup:
| Item | Value |
|---|---|
| Backbone | CogVideoX-Fun-V1.1-5b-InP |
| GPUs | 8 Γ NVIDIA A800 |
| Iterations | 20K |
| Frames | 49 |
| FPS | 8 |
| Resolution | 512, 768, 1024, 1280 |
π₯ Inference
Generate a video using a reference video:
bash examples/cogvideox_fun/run_predict_i2v.sh
π Citation
If you find our work useful, please cite:
@misc{wu2026video2loraunifiedsemanticcontrolledvideo,
title={Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA},
author={Zexi Wu and Baolu Li and Jing Dai and Yiming Zhang and Yue Ma and Qinghe Wang and Xu Jia and Hongming Xu},
year={2026},
eprint={2603.08210},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08210},
}
If you find this project useful, please consider starring the repository to support our work.