README.md
May 20, 2026 ยท View on GitHub
MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
Shuai Zhang*, Bao Tang*, Siyuan Yu*, Yueting Zhu, Jingfeng Yao,
Ya Zou, Shanglin Yuan, Li Yu, Wenyu Liu, Xinggang Wang๐ง
Huazhong University of Science and Technology (HUST)
(* equal contribution, ๐ง corresponding author)
๐ฐ News
- [2026.05.20] We open-sourced our distillation code.
- [2025.11.27] We have released our paper on arXiv and our base model code.
๐ Introduction
๐ฏ Demo
(1) 1280ร720ร17 Image to Video
(2) 960ร960ร17 Image to Video
๐ฏ How to Use
Installation
You can install the required environment using the provided requirements.txt file.
pip install -r requirements.txt
Data Processing
There are many open source video datasets, such as Openvid, VFHQ and Celebv-text. The video should be cut into a fixed number of frames (such as 17 or 25...), and the video data should be filtered based on aesthetic (use DOVER) and optical flow scores (refer to OpenSora data Processing).
You should organize your processed train data into a CSV file, as shown below:
video_path,text,num_frames,height,width,flow
./_JnC_Zj_P7s_22_0to190_extracted.mp4,scenery,17,720,1080,3.529723644
./_JnC_Zj_P7s_22_0to190_extracted.mp4,scenery,17,720,1080,4.014187813
Train
You can use the provided ./train_scripts/train_i2v.sh script for training. The configuration file is located at: ./configs/mobilei2v_config/. Before training, download the weights for video-vae and qwen2-0.5B and replace the model path in the configuration file.
bash ./train_scripts/train_i2v.sh
Inference
You can use the provided ./test.sh script for inference. Provide a reference image or video (extract the first frame) to the asset/test.txt file and pass it to the --txt_file parameter.
CUDA_VISIBLE_DEVICES=0 python scripts/inference_i2v.py \
--config=./configs/mobilei2v_config/MobileI2V_300M_img512.yaml \
--save_path=humface_1126 \
--model_path=./model/hybrid_371.pth \
--txt_file=asset/test.txt \
--flow_score=2.0 \
To achieve faster VAE decoder speeds, we replaced the LTX-Video decoder with the Turbo-VAED decoder.
Metrics
Refer to the FVD evaluation script in vidm.
python scripts/evaluate_FVD.py -dir1 path/gts -dir2 path/videos -b 1 -r 32 -n 128 -ns 16 -i3d ./i3d_torchscript.pt
Distillation training
The training data is consistent with that used for base model training. Please refer to our distillation code, which is available in the distillation branch of this repository.
๐ฏ Mobile Demo
We designed the mobile UI and deployed the model, as shown in the video below:
โค๏ธ Acknowledgements
Our MobileI2V codes are mainly built with SANA and LTX-Video. The data processing workflow is based on OpenSora. Thanks for all these great works.
๐ Citation
If you find MobileI2V useful, please consider giving us a star ๐ and citing it as follows:
@misc{MobileI2V,
title={MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices},
author={Shuai Zhang and Bao Tang and Siyuan Yu and Yueting Zhu and Jingfeng Yao and Ya Zou and Shanglin Yuan and Li Yu and Wenyu Liu and Xinggang Wang},
year={2025},
eprint={2511.21475},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21475},
}