SmoothVideo
November 25, 2023 ยท View on GitHub
This repository is the official implementation of Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning.
Setup
This implementation is based on Tune-A-Video.
Requirements
pip install -r requirements.txt
Installing xformers is highly recommended for more efficiency and speed on GPUs.
To enable xformers, set enable_xformers_memory_efficient_attention=True (default).
Weights
[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-5)).
Usage
Training
To fine-tune the text-to-image diffusion models for text-to-video generation, run this command for the baseline model:
accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"
Run this command for the baseline model with the proposed smooth loss:
accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss
Run this command for the baseline model with the proposed simple smooth loss:
accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss --simple_manner
Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU.
Reduce n_sample_frames if your GPU memory is limited.
Inference
Once the training is done, run inference:
from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch
pretrained_model_path = "./checkpoints/stable-diffusion-v1-5"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()
prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos
save_videos_grid(video, f"./{prompt}.gif")
We provide comparisons with different baselines, as follows:
Results
Tune-A-Video
Comparisons to Tune-A-Video.
| Input video | Tune-A-Video | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| Input video | Tune-A-Video + smooth loss | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| A jeep car is moving on the road | A jeep car is moving on the beach | A jeep car is moving on the snow | A jeep car is moving on the road, cartoon style | A sports car is moving on the road |
| Input video | Tune-A-Video | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| Input video | Tune-A-Video + smooth loss | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| A rabbit is eating a watermelon | A tiger is eating a watermelon | A rabbit is eating an orange | A rabbit is eating a pizza | A puppy is eating an orange |
| Input video | Tune-A-Video | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| Input video | Tune-A-Video + smooth loss | |||
![]() |
![]() |
![]() |
![]() |
![]() |
| A man is skiing | Mickey mouse is skiing on the snow | Spider man is skiing on the beach, cartoon style | Wonder woman, wearing a cowboy hat, is skiing | A man, wearing pink clothes, is skiing at sunset |
Make-A-Protagonist
Comparisons to Make-A-Protagonist.
| Input video | Make-A-Protagonist | Make-A-Protagonist + smooth loss |
![]() |
![]() |
![]() |
| A jeep driving down a mountain road | A jeep driving down a mountain road in the rain | |
![]() |
![]() |
![]() |
| A man is playing basketball | A man is playing a basketball on the beach, anime style | |
![]() |
![]() |
![]() |
| A man walking down the street at night | A panda walking down the snowy street | |
![]() |
![]() |
![]() |
| A man waling down the street | Elon musk walking down the street | |
ControlVideo
Comparisons to ControlVideo.
| Input video | Condition | ControlVideo | ControlVideo + smooth loss |
![]() |
![]() |
![]() |
![]() |
| A person is dancing | Pose condition | Michael Jackson is dancing | |
![]() |
![]() |
![]() |
![]() |
| A person is dancing | Pose condition | A person is dancing, Makoto Shinkai style | |
![]() |
![]() |
![]() |
![]() |
| A building | Canny edge condition | A wooden building, at night | |
![]() |
![]() |
![]() |
![]() |
| A girl | Hed edge condition | A girl, Krenz Cushart style | |
![]() |
![]() |
![]() |
![]() |
| A girl | Hed edge condition | A girl with rich makeup | |
![]() |
![]() |
![]() |
![]() |
| Ink diffuses in water | Depth condition | Gentle green ink diffuses in water, beautiful light | |
Video2Video-zero
Comparisons to Training-free methods.
| Input video | Instruct Video2Video-zero | Instruct Video2Video-zero + noise constraint | Video InstructPix2Pix | Video InstructPix2Pix + noise constraint |
![]() |
![]() |
![]() |
![]() |
![]() |
| Instruct: Make it animation | ||||



































































