SAVE: Protagonist Diversification with Structure Agnostic Video Editing (ECCV 2024)
November 22, 2024 ยท View on GitHub
This repository contains the official implementation of SAVE: Protagonist Diversification with Structure Agnostic Video Editing.
Teaser
๐ฑ A cat is roaring โ ๐ถ A dog is < Smot > / ๐ฏ A tiger is < Smot >


๐ A man is skiing โ ๐ป A bear is < Smot > / ๐ญ Mickey-Mouse is < Smot >


SAVE reframes the video editing task as a motion inversion problem, seeking to find the motion word < Smot > in textual embedding space to well represent the motion in a source video. The video editing task can be achieved by isolating the motion from a single source video with < Smot > and then modifying the protagonist accordingly.
Setup
Requirements
pip install -r requirements.txt
Weights
We use Stable Diffusion v1-4 as our base text-to-image model and fine-tune it on a reference video for text-to-video generation. Example video weights are available at GoogleDrive.
Training
To fine-tune the text-to-image diffusion models on a custom video, run this command:
python run_train.py --config configs/<video-name>-train.yaml
Configuration file <video-name>-train.yaml contains the following arguments:
output_dir- Directory to save the weights.placeholder_tokens- Pseudo words separated by|e.g.,<s1>|<s2>.initializer_tokens- Initialization words separated by|e.g.,cat|roaring.sentence_component- Use<o>for appearance words and<v>for motion words e.g.,<o>|<v>.num_s1_train_epochs- Number of epochs for appearance pre-registration.exp_localization_weight- Weight for the cross-attention loss (recommended range is 1e-4 to 5e-4).train_data: video_path- Path to the source video.train_data: prompt- Source prompt that includes the pseudo words inplaceholder_tokense.g.,a <s1> cat is <s2>.n_sample_frames- Number of frames.
Video Editing
Once the updated weights are prepared, run this command:
python run_inference.py --config configs/<video-name>-inference.yaml
Configuration file <video-name>-inference.yaml contains the following arguments:
pretrained_model_path- Directory to the saved weights.image_path- Path to the source video.placeholder_tokens- Pseudo words separated by|e.g.,<s1>|<s2>.sentence_component- Use<o>for appearance words and<v>for motion words e.g.,<o>|<v>.prompt- Source prompt that includes the pseudo words inplaceholder_tokense.g.,a <s1> cat is <s2>.prompts- List of source and editing prompts e.g., [a <s1> cat is <s2>,a dog is <s2>].blend_word- List of protagonists in the source and edited videos e.g., [cat,dog].
Citation
@inproceedings{song2025save,
title={Save: Protagonist diversification with structure agnostic video editing},
author={Song, Yeji and Shin, Wonsik and Lee, Junsoo and Kim, Jeesoo and Kwak, Nojun},
booktitle={European Conference on Computer Vision},
pages={41--57},
year={2025},
organization={Springer}
}
Acknowledgements
This code builds upon diffusers, Tune-A-Video and Video-P2P. Thank you for open-sourcing!