EfficientMT
October 3, 2025 · View on GitHub
Official implementation of the paper "EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models" [ICCV 2025].
It is an efficient end-to-end model for video motion transfer, which adapts a pretrained T2V model into a motion transfer framework with a few synthetic data training. Flexible subject/camera motion transfer can be performed in a few seconds without additional optimization steps.
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
https://github.com/user-attachments/assets/15540f91-998c-4320-a1bb-973a4c7f29e5
Gallery
Online viewing may cause motion desync. We recommend downloading the page to view locally for a better preview.
Subject Motion
|
|
||||
|
|
Camera Motion
|
|
||||
|
|
Quick Start
Environment Setup
Clone this repo and install the dependencies in the requirements.txt.
git clone https://github.com/PrototypeNx/EfficientMT.git
cd EfficientMT
conda create -n efficientmt python=3.10
conda activate efficientmt
pip install -r requirements.txt
Model Download
Download the pretrained stable-diffusion-v1-5 adopted by AnimateDiff and put in models/StableDiffusion/
git lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 models/StableDiffusion/
In this project, we use the original stable-diffusion-v1-5 version corresponding to unet/diffusion_pytorch_model.bin with a size of 3.44GB. You can manually download each file and put them in models/StableDiffusion/stable-diffusion-v1-5 with a total size of 3.97GB.
Download the weights uploaded in HuggingFace or ModelScope and put them in models/.
Prepare the motion module v3_sd15_mm.ckpt adopted from AnimateDiff in models/Motion_Module.
Prepare the visual enhancement LoRA module realisticVisionV60B1_v51VAE.safetensors adopted from RealisticVisionV6.0 in models/DreamBooth_LoRA.
Prepare the pretrained integrated temporal attention weights checkpoint.ckpt in models/Integrated_Attention.
Inference
Run the following script to generate the motion transfer results:
python inference.py --config configs/prompts/inference.yaml\
--ckpt_path models/Integrated_Attention/checkpoint.ckpt \
--ref_video_path assets/references/sample_white_tiger.mp4\
--prompt "cat walking on the beach."
Modify the ref_video_path and prompt to select the reference video and customize your motion transfer prompt.
Or run the following scripts:
bash infer.sh
The motion transfer results are saved in outputs/<ref_video_name>/.
The Gallery results can be reproduced using the same prompt and the corresponding reference videos in the assets/references.
Technical Explanation
Overview of our EfficientMT. (a): We reuse the backbone of the T2V model to extract reference features, which are then injected into the temporal attention layers of the upsampling stage through a temporal integration mechanism. (b): The scaler predicts a fine-grained scale map for the reference features, filtering out irrelevant information. (c): The temporal integration concatenates the features along the temporal axis, while the query is projected from the origin, the key and value are obtained from the integrated features.
Contact Us
BibTeX
@article{cai2025efficientmt,
title={EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models},
author={Yufei Cai and Hu Han and Yuxiang Wei and Shiguang Shan and Xilin Chen},
journal={arXiv preprint arXiv:2503.19369},
year={2025}
}