VideoTuna

April 29, 2025 · View on GitHub

VideoTuna

🤗🤗🤗 Videotuna is a useful codebase for text-to-video applications.
🌟 VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), and video-to-video (V2V) generation for model inference and finetuning (to the best of our knowledge).
🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).

🔆 Features

videotuna-pipeline-fig3 🌟 All-in-one framework: Inference and fine-tune various up-to-date pre-trained video generation models.
🌟 Continuous training: Keep improving your model with new data.
🌟 Fine-tuning: Adapt pre-trained models to specific domains.
🌟 Human preference alignment: Leverage RLHF to align with human preferences.
🌟 Post-processing: Enhance and rectify the videos with video-to-video enhancement model.

🔆 Updates

[2025-04-22] 🐟 Supported inference for Wan2.1 and Step Video and fine-tuning for HunyuanVideo T2V, with a unified codebase architecture.
[2025-02-03] 🐟 Supported automatic code formatting via PR#27. Thanks @samidarko!
[2025-02-01] 🐟 Migrated to Poetry for streamlined dependency and script management (PR#25). Thanks @samidarko!
[2025-01-20] 🐟 Supported fine-tuning for Flux-T2I.
[2025-01-01] 🐟 Released training for VideoVAE+ in the VideoVAEPlus repo.
[2025-01-01] 🐟 Supported inference for Hunyuan Video and Mochi.
[2024-12-24] 🐟 Released VideoVAE+: a SOTA Video VAE model—now available in this repo! Achieves better video reconstruction than NVIDIA’s Cosmos-Tokenizer.
[2024-12-01] 🐟 Supported inference for CogVideoX-1.5-T2V&I2V and Video-to-Video Enhancement from ModelScope.
[2024-12-01] 🐟 Supported fine-tuning for CogVideoX.
[2024-11-01] 🐟 🎉 Released VideoTuna v0.1.0!
Initial support includes inference for VideoCrafter1-T2V&I2V, VideoCrafter2-T2V, DynamiCrafter-I2V, OpenSora-T2V, CogVideoX-1-2B-T2V, CogVideoX-1-T2V, Flux-T2I, and training/fine-tuning of VideoCrafter, DynamiCrafter, and Open-Sora.

🔆 Get started

1.Prepare environment

conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install

↑ It takes around 3 minitues.

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn

↑ It takes 1 minitue.

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If this command ↑ get stucked, kill and re-run it will solve the issue.

(2) If you use Linux and Poetry (without Conda):

Click to check instructions

Install Poetry: https://python-poetry.org/docs/#installation
Then:

poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
poetry install

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If this command ↑ get stucked, kill and re-run it will solve the issue.

(3) If you use MacOS

Click to check instructions

On MacOS with Apple Silicon chip use docker compose because some dependencies are not supporting arm64 (e.g. bitsandbytes, decord, xformers).

First build:

docker compose build videotuna

To preserve the project's files permissions set those env variables:

export HOST_UID=$(id -u)
export HOST_GID=$(id -g)

Install dependencies:

docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
docker compose run --remove-orphans videotuna poetry install
docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Note: installing swissarmytransformer might hang. Just try again and it should work.

Add a dependency:

docker compose run --remove-orphans videotuna poetry add wheel

Check dependencies:

docker compose run --remove-orphans videotuna poetry run pip freeze

Run Poetry commands:

docker compose run --remove-orphans videotuna poetry run format

Start a terminal:

docker compose run -it --remove-orphans videotuna bash

2.Prepare checkpoints

Please follow docs/checkpoints.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.

3.Inference state-of-the-art T2V/I2V/T2I models

Run the following commands to inference models: It will automatically perform T2V/T2I based on prompts in inputs/t2v/prompts.txt, and I2V based on images and prompts in inputs/i2v/576x1024.

T2V

Task	Model	Command	Length (#Frames)	Resolution	Inference Time	GPU Memory (GB)
T2V	HunyuanVideo	`poetry run inference-hunyuan-t2v`	129	720x1280	32min	60G
T2V	WanVideo	`poetry run inference-wanvideo-t2v-720p`	81	720x1280	32min	70G
T2V	StepVideo	`poetry run inference-stepvideo-t2v-544x992`	51	544x992	8min	61G
T2V	Mochi	`poetry run inference-mochi`	84	480x848	2min	26G
T2V	CogVideoX-5b	`poetry run inference-cogvideo-t2v-diffusers`	49	480x720	2min	3G
T2V	CogVideoX-2b	`poetry run inference-cogvideo-t2v-diffusers`	49	480x720	2min	3G
T2V	Open Sora V1.0	`poetry run inference-opensora-v10-16x256x256`	16	256x256	11s	24G
T2V	VideoCrafter-V2-320x512	`poetry run inference-vc2-t2v-320x512`	16	320x512	26s	11G
T2V	VideoCrafter-V1-576x1024	`poetry run inference-vc1-t2v-576x1024`	16	576x1024	2min	15G

I2V

Task	Model	Command	Length (#Frames)	Resolution	Inference Time	GPU Memory (GB)
I2V	WanVideo	`poetry run inference-wanvideo-i2v-720p`	81	720x1280	28min	77G
I2V	HunyuanVideo	`poetry run inference-hunyuan-i2v-720p`	129	720x1280	29min	43G
I2V	CogVideoX-5b-I2V	`poetry run inference-cogvideox-15-5b-i2v`	49	480x720	5min	5G
I2V	DynamiCrafter	`poetry run inference-dc-i2v-576x1024`	16	576x1024	2min	53G
I2V	VideoCrafter-V1	`poetry run inference-vc1-i2v-320x512`	16	320x512	26s	11G

T2I

Task	Model	Command	Length (#Frames)	Resolution	Inference Time	GPU Memory (GB)
T2I	Flux-dev	`poetry run inference-flux-dev`	1	768x1360	4s	37G
T2I	Flux-dev	`poetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload`	1	768x1360	4.2min	2G
T2I	Flux-schnell	`poetry run inference-flux-schnell`	1	768x1360	1s	37G
T2I	Flux-schnell	`poetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload`	1	768x1360	24s	2G

4. Finetune T2V models

(1) Prepare dataset

Please follow the docs/datasets.md to try provided toydataset or build your own datasets.

(2) Fine-tune

All training commands were tested on H800 80G GPUs.
T2V

Task	Model	Mode	Command	More Details	#GPUs
T2V	Wan Video	Lora Fine-tune	`poetry run train-wan2-1-t2v-lora`	docs/finetune_wan.md	1
T2V	Wan Video	Full Fine-tune	`poetry run train-wan2-1-t2v-fullft`	docs/finetune_wan.md	1
T2V	Hunyuan Video	Lora Fine-tune	`poetry run train-hunyuan-t2v-lora`	docs/finetune_hunyuanvideo.md	2
T2V	CogvideoX	Lora Fine-tune	`poetry run train-cogvideox-t2v-lora`	docs/finetune_cogvideox.md	1
T2V	CogvideoX	Full Fine-tune	`poetry run train-cogvideox-t2v-fullft`	docs/finetune_cogvideox.md	4
T2V	Open-Sora v1.0	Full Fine-tune	`poetry run train-opensorav10`	-	1
T2V	VideoCrafter	Lora Fine-tune	`poetry run train-videocrafter-lora`	docs/finetune_videocrafter.md	1
T2V	VideoCrafter	Full Fine-tune	`poetry run train-videocrafter-v2`	docs/finetune_videocrafter.md	1

I2V

Task	Model	Mode	Command	More Details	#GPUs
I2V	Wan Video	Lora Fine-tune	`poetry run train-wan2-1-i2v-lora`	docs/finetune_wan.md	1
I2V	Wan Video	Full Fine-tune	`poetry run train-wan2-1-i2v-fullft`	docs/finetune_wan.md	1
I2V	CogvideoX	Lora Fine-tune	`poetry run train-cogvideox-i2v-lora`	docs/finetune_cogvideox.md	1
I2V	CogvideoX	Full Fine-tune	`poetry run train-cogvideox-i2v-fullft`	docs/finetune_cogvideox.md	4

T2I

Task	Model	Mode	Command	More Details	#GPUs
T2I	Flux	Lora Fine-tune	`poetry run train-flux-lora`	docs/finetune_flux.md	1

poetry run pre-commit install
poetry run pre-commit install --hook-type commit-msg

Running the hooks without commiting

poetry run pre-commit run --all-files

Acknowledgement

We thank the following repos for sharing their awesome models and codes!

Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models.
HunyuanVideo: A Systematic Framework For Large Video Generation Model.
Step-Video: A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
Mochi: A new SOTA in open-source video generation models
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Open-Sora: Democratizing Efficient Video Production for All
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
VADER: Video Diffusion Alignment via Reward Gradients
VBench: Comprehensive Benchmark Suite for Video Generative Models
Flux: Text-to-image models from Black Forest Labs.
SimpleTuner: A fine-tuning kit for text-to-image generation.

Some Resources

LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
MMTrail: A multimodal trailer video dataset with language and music descriptions.
Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
Animate-A-Story: A framework for storytelling video generation.
LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.

@software{videotuna,
  author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
  title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
  month = {Nov},
  year = {2024},
  url = {https://github.com/VideoVerses/VideoTuna}
}

VideoTuna

VideoTuna

🔆 Features

🔆 Updates

🔆 Get started

1.Prepare environment

(2) If you use Linux and Poetry (without Conda):

(3) If you use MacOS

2.Prepare checkpoints

3.Inference state-of-the-art T2V/I2V/T2I models

4. Finetune T2V models

(1) Prepare dataset

(2) Fine-tune

5. Evaluation

Contribute

Git hooks

Hooks installation

Running the hooks without commiting

Acknowledgement

Some Resources

🍻 Contributors

📋 License

😊 Citation

Star History