VideoTuna

April 29, 2025 Β· View on GitHub

VideoTuna

VideoTuna

Version visitors Homepage GitHub

πŸ€—πŸ€—πŸ€— Videotuna is a useful codebase for text-to-video applications.
🌟 VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), and video-to-video (V2V) generation for model inference and finetuning (to the best of our knowledge).
🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).

πŸ”† Features

videotuna-pipeline-fig3 🌟 All-in-one framework: Inference and fine-tune various up-to-date pre-trained video generation models.
🌟 Continuous training: Keep improving your model with new data.
🌟 Fine-tuning: Adapt pre-trained models to specific domains.
🌟 Human preference alignment: Leverage RLHF to align with human preferences.
🌟 Post-processing: Enhance and rectify the videos with video-to-video enhancement model.

πŸ”† Updates

  • [2025-04-22] 🐟 Supported inference for Wan2.1 and Step Video and fine-tuning for HunyuanVideo T2V, with a unified codebase architecture.
  • [2025-02-03] 🐟 Supported automatic code formatting via PR#27. Thanks @samidarko!
  • [2025-02-01] 🐟 Migrated to Poetry for streamlined dependency and script management (PR#25). Thanks @samidarko!
  • [2025-01-20] 🐟 Supported fine-tuning for Flux-T2I.
  • [2025-01-01] 🐟 Released training for VideoVAE+ in the VideoVAEPlus repo.
  • [2025-01-01] 🐟 Supported inference for Hunyuan Video and Mochi.
  • [2024-12-24] 🐟 Released VideoVAE+: a SOTA Video VAE modelβ€”now available in this repo! Achieves better video reconstruction than NVIDIA’s Cosmos-Tokenizer.
  • [2024-12-01] 🐟 Supported inference for CogVideoX-1.5-T2V&I2V and Video-to-Video Enhancement from ModelScope.
  • [2024-12-01] 🐟 Supported fine-tuning for CogVideoX.
  • [2024-11-01] 🐟 πŸŽ‰ Released VideoTuna v0.1.0!
    Initial support includes inference for VideoCrafter1-T2V&I2V, VideoCrafter2-T2V, DynamiCrafter-I2V, OpenSora-T2V, CogVideoX-1-2B-T2V, CogVideoX-1-T2V, Flux-T2I, and training/fine-tuning of VideoCrafter, DynamiCrafter, and Open-Sora.

πŸ”† Get started

1.Prepare environment

(1) If you use Linux and Conda (Recommend)

conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install
  • ↑ It takes around 3 minitues.

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn 
  • ↑ It takes 1 minitue.

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
  • If this command ↑ get stucked, kill and re-run it will solve the issue.

(2) If you use Linux and Poetry (without Conda):

Click to check instructions

Install Poetry: https://python-poetry.org/docs/#installation
Then:

poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
poetry install

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
  • If this command ↑ get stucked, kill and re-run it will solve the issue.

(3) If you use MacOS

Click to check instructions

On MacOS with Apple Silicon chip use docker compose because some dependencies are not supporting arm64 (e.g. bitsandbytes, decord, xformers).

First build:

docker compose build videotuna

To preserve the project's files permissions set those env variables:

export HOST_UID=$(id -u)
export HOST_GID=$(id -g)

Install dependencies:

docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
docker compose run --remove-orphans videotuna poetry install
docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Note: installing swissarmytransformer might hang. Just try again and it should work.

Add a dependency:

docker compose run --remove-orphans videotuna poetry add wheel

Check dependencies:

docker compose run --remove-orphans videotuna poetry run pip freeze

Run Poetry commands:

docker compose run --remove-orphans videotuna poetry run format

Start a terminal:

docker compose run -it --remove-orphans videotuna bash

2.Prepare checkpoints

3.Inference state-of-the-art T2V/I2V/T2I models

Run the following commands to inference models: It will automatically perform T2V/T2I based on prompts in inputs/t2v/prompts.txt, and I2V based on images and prompts in inputs/i2v/576x1024.

T2V

TaskModelCommandLength (#Frames)ResolutionInference TimeGPU Memory (GB)
T2VHunyuanVideopoetry run inference-hunyuan-t2v129720x128032min60G
T2VWanVideopoetry run inference-wanvideo-t2v-720p81720x128032min70G
T2VStepVideopoetry run inference-stepvideo-t2v-544x99251544x9928min61G
T2VMochipoetry run inference-mochi84480x8482min26G
T2VCogVideoX-5bpoetry run inference-cogvideo-t2v-diffusers49480x7202min3G
T2VCogVideoX-2bpoetry run inference-cogvideo-t2v-diffusers49480x7202min3G
T2VOpen Sora V1.0poetry run inference-opensora-v10-16x256x25616256x25611s24G
T2VVideoCrafter-V2-320x512poetry run inference-vc2-t2v-320x51216320x51226s11G
T2VVideoCrafter-V1-576x1024poetry run inference-vc1-t2v-576x102416576x10242min15G

I2V

TaskModelCommandLength (#Frames)ResolutionInference TimeGPU Memory (GB)
I2VWanVideopoetry run inference-wanvideo-i2v-720p 81720x128028min77G
I2VHunyuanVideopoetry run inference-hunyuan-i2v-720p129720x128029min43G
I2VCogVideoX-5b-I2Vpoetry run inference-cogvideox-15-5b-i2v49480x7205min5G
I2VDynamiCrafterpoetry run inference-dc-i2v-576x102416576x10242min53G
I2VVideoCrafter-V1poetry run inference-vc1-i2v-320x51216320x51226s11G

T2I

TaskModelCommandLength (#Frames)ResolutionInference TimeGPU Memory (GB)
T2IFlux-devpoetry run inference-flux-dev1768x13604s37G
T2IFlux-devpoetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload1768x13604.2min2G
T2IFlux-schnellpoetry run inference-flux-schnell1768x13601s37G
T2IFlux-schnellpoetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload1768x136024s2G

4. Finetune T2V models

(1) Prepare dataset

Please follow the docs/datasets.md to try provided toydataset or build your own datasets.

(2) Fine-tune

All training commands were tested on H800 80G GPUs.
T2V

TaskModelModeCommandMore Details#GPUs
T2VWan VideoLora Fine-tunepoetry run train-wan2-1-t2v-loradocs/finetune_wan.md1
T2VWan VideoFull Fine-tunepoetry run train-wan2-1-t2v-fullftdocs/finetune_wan.md1
T2VHunyuan VideoLora Fine-tunepoetry run train-hunyuan-t2v-loradocs/finetune_hunyuanvideo.md2
T2VCogvideoXLora Fine-tunepoetry run train-cogvideox-t2v-loradocs/finetune_cogvideox.md1
T2VCogvideoXFull Fine-tunepoetry run train-cogvideox-t2v-fullftdocs/finetune_cogvideox.md4
T2VOpen-Sora v1.0Full Fine-tunepoetry run train-opensorav10-1
T2VVideoCrafterLora Fine-tunepoetry run train-videocrafter-loradocs/finetune_videocrafter.md1
T2VVideoCrafterFull Fine-tunepoetry run train-videocrafter-v2docs/finetune_videocrafter.md1

I2V

TaskModelModeCommandMore Details#GPUs
I2VWan VideoLora Fine-tunepoetry run train-wan2-1-i2v-loradocs/finetune_wan.md1
I2VWan VideoFull Fine-tunepoetry run train-wan2-1-i2v-fullftdocs/finetune_wan.md1
I2VCogvideoXLora Fine-tunepoetry run train-cogvideox-i2v-loradocs/finetune_cogvideox.md1
I2VCogvideoXFull Fine-tunepoetry run train-cogvideox-i2v-fullftdocs/finetune_cogvideox.md4

T2I

TaskModelModeCommandMore Details#GPUs
T2IFluxLora Fine-tunepoetry run train-flux-loradocs/finetune_flux.md1

5. Evaluation

We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.

Contribute

Git hooks

Git hooks are handled with pre-commit library.

Hooks installation

Run the following command to install hooks on commit. They will check formatting, linting and types.

poetry run pre-commit install
poetry run pre-commit install --hook-type commit-msg

Running the hooks without commiting

poetry run pre-commit run --all-files

Acknowledgement

We thank the following repos for sharing their awesome models and codes!

  • Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models.
  • HunyuanVideo: A Systematic Framework For Large Video Generation Model.
  • Step-Video: A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
  • Mochi: A new SOTA in open-source video generation models
  • VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
  • VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
  • Open-Sora: Democratizing Efficient Video Production for All
  • CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
  • VADER: Video Diffusion Alignment via Reward Gradients
  • VBench: Comprehensive Benchmark Suite for Video Generative Models
  • Flux: Text-to-image models from Black Forest Labs.
  • SimpleTuner: A fine-tuning kit for text-to-image generation.

Some Resources

🍻 Contributors

πŸ“‹ License

Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).

😊 Citation

@software{videotuna,
  author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
  title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
  month = {Nov},
  year = {2024},
  url = {https://github.com/VideoVerses/VideoTuna}
}

Star History

Star History Chart