TA2V: Text-Audio Guided Video Generation
December 1, 2025 ยท View on GitHub
This is the official implement of our proposed Text&Audio-guided Video Maker (TAgVM) model of TA2V task. Since we pay more attention to music performance video generation, given both the text prompt and the audio signals as input, the model is able to synthesize motion or gesture of the players moving with corresponding melody and rhythm.
Examples
Music Performance Videos
Landscape Videos
Failure
Setup
- Create the virtual environment
conda create -n tav python==3.9
conda activate tav
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers
- Create a
saved_ckptsfolder to download pretrained checkpoints.
Datasets
We create two three-modality datasets named as URMP-VAT and Landscape-VAT, where there are four folders (mp4, stft_pickle, audio, txt) in each training dataset or testing dataset.
You can download these processed datasets to datasets folder.
URMP-VAT (Landscape-VAT)
|---train
|---mp4
|---stft_pickle
|---txt
|---wav
|---test
|---mp4
|---stft_pickle
|---txt
|---wav
Download pre-trained checkpoints
| Dataset | VQGAN | GPT | Diffusion |
|---|---|---|---|
| URMP-VAT | URMP-VAT_video_VQGAN.ckpt | URMP-VAT_GPT.ckpt | URMP-VAT_diffusion.pt |
| Landscape-VAT | Landscape-VAT_video_VQGAN.ckpt | Landscape-VAT_GPT.ckpt | Landscape-VAT_diffusion.pt |
Since we utilize AudioCLIP model to encode audio and images, you can download the checkpoints in their open project page.
Sampling Procedure
Sample Short Music Performance Videos
gpt_text_ckpt: path to GPT checkpointvqgan_ckpt: path to video VQGAN checkpointdata_path: path to dataset, you can change it topost_landscapefor Landscape-VAT datasetload_vid_len: for URMP-VAT, it is set to90(fps=30); for Landscape-VAT, it is set to30(fps=10)text_emb_model: model to encode text, choices:bert,clipaudio_emb_model: model to encode audio, choices:audioclip,wav2cliptext_stft_cond: load text-audio-video datan_sample: the number of videos need to be sampledrun: index for each runresolution: resolution used in training video VQGAN proceduremodel_output_size: the resolution when training the diffusion modelaudio_guidance_lambda: coefficient to control audio guidancedirection_lambda: coefficient to control semantic change consistency of audio and videotext_guidance_lambda: coefficient to control text guidancediffusion_ckpt: path to diffusion model
python scripts/sample_tav.py --gpt_text_ckpt saved_ckpts/best_checkpoint-val_text_loss=2.74.ckpt --text_stft_cond \
--vqgan_ckpt saved_ckpts/epoch=6-step=35999-train_recon_loss=0.15.ckpt --text_emb_model bert \
--data_path datasets/post_URMP/ --top_k 2048 --top_p 0.80 --n_sample 50 --run 17 --dataset URMP --load_vid_len 90 \
--audio_emb_model audioclip --resolution 96 --batch_size 1 --model_output_size 128 --noise_schedule cosine \
--iterations_num 1 --audio_guidance_lambda 10000 --direction_lambda 5000 --text_guidance_lambda 10000 \
--diffusion_ckpt saved_ckpts/model300000.pt
Calculate Evaluation Metrics
exp_tag: name of result folder, which is underresultsfolderaudio_folder: audio folder name, default:audiovideo_folder: video folder name, choices:fake_stage1,fake_stage2,realtxt_folder: text folder name, default:txt
- CLIP audio score
python tools/clip_score/clip_audio.py --exp_tag 1_tav_URMP --audio_folder audio --video_folder fake_stage2 --audio_emb_model audioclip
- CLIP text score
python tools/clip_score/clip_text.py --exp_tag 1_tav_URMP --txt_folder txt --video_folder fake_stage2 --batch_size 5
real_folder: ground-truth video folder name, default:realfake_folder: generated video folder name, choices:fake_stage1,fake_stage2mode: mode to calculate FVD, FID scores, choices:full,size
- FVD
python tools/tf_fvd/fvd.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full
- FID
python tools/tf_fvd/fid.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full
Training Procedure
You can also train the models on customized datasets. Here we provide the command to train VQGAN, Transformer and Diffusion models.
video VQGAN
embedding_dim: dimension of codebook embeddings, default:256n_codes: size of codebook, default:16384n_hiddens: hidden channels base, default:32downsample: ratio of downsampling, default:4 8 8,4for temporal dimension and8 8for spatial dimensionlr: learning ratedata_path: path to datasetdefault_root_dir: path to save checkpointsresolution: video resolution to train, we set96for URMP-VAT,64for Landscape-VATsequence_length: length of videos, default:16
python scripts/train_vqgan.py --embedding_dim 256 --n_codes 16384 --n_hiddens 32 --downsample 4 8 8 --no_random_restart \
--gpus 1 --batch_size 8 --num_workers 16 --accumulate_grad_batches 6 --progress_bar_refresh_rate 100 --max_steps 100000 \
--gradient_clip_val 1.0 --lr 6.0e-5 --data_path datasets/post_URMP/ --default_root_dir path/to/save \
--resolution 9 --sequence_length 16 --discriminator_iter_start 10000 --norm_type batch --perceptual_weight 4 \
--image_gan_weight 1 --video_gan_weight 1 --gan_feat_weight 4
Transformer
first_stage_key: load first stage data from a batch, default:videocond1_stage_key: load condition stage data from a batch, default:textvqvae: path to load pretrained VQGAN checkpointsn_layer: the number of layers in transformern_head: the number of heads in transformern_embd: dimension of embeddings in transformertext_seq_len: maximum length of text tokensembd_pdrop: dropout ratio in embedding stepresid_pdrop: dropout ratio in transformer blocksattn_pdrop: dropout ratio in attention blocks
python scripts/train_text_transformer.py --num_workers 4 --val_check_interval 0.5 --progress_bar_refresh_rate 100 \
--gpus 1 --sync_batchnorm --batch_size 8 --first_stage_key video --cond1_stage_key text --text_stft_cond --text_emb_model bert \
--vqvae path/to/video/vqgan --data_path datasets/post_URMP/ --load_vid_len 30 --default_root_dir path/to/save \
--base_lr 4.5e-05 --first_stage_vocab_size 16384 --block_size 1024 --n_layer 12 --n_head 8 --n_embd 512 --resolution 96 \
--sequence_length 16 --text_seq_len 12 --max_steps 500000 --embd_pdrop 0.2 --resid_pdrop 0.2 --attn_pdrop 0.2
Diffusion
save_dir: path to save checkpointsdiffusion_steps: the number of steps to denoisenoise_schedule: choices:cosine,linearnum_channels: latent channels basenum_res_blocks: the number of resnet blocks in diffusionclass_cond: whether using class or notimage_size: resolution of videos/images
python scripts/diffusion_video_train_3d.py --num_workers 8 --gpus 1 --batch_size 1 --text_stft_cond --data_path datasets/post_URMP/ \
--load_vid_len 30 --save_dir path/to/save --resolution 128 --sequence_length 16 --diffusion_steps 4000 --noise_schedule cosine \
--lr 5e-5 --num_channels 128 --num_res_blocks 3 --class_cond False --log_interval 50 --save_interval 5000 --image_size 128 --learn_sigma True
We use 3D diffusion here, setting dims=3 in U-Net for convenience.
Acknowledgements
Our code is based on TATS and blended-diffusion.
Citation
If you find our work useful, please consider citing our paper.
@article{zhao2024ta2v,
title={Ta2v: Text-audio guided video generation},
author={Zhao, Minglu and Wang, Wenmin and Chen, Tongbao and Zhang, Rui and Li, Ruochen},
journal={IEEE Transactions on Multimedia},
volume={26},
pages={7250--7264},
year={2024},
publisher={IEEE}
}