TA2V: Text-Audio Guided Video Generation

December 1, 2025 ยท View on GitHub

This is the official implement of our proposed Text&Audio-guided Video Maker (TAgVM) model of TA2V task. Since we pay more attention to music performance video generation, given both the text prompt and the audio signals as input, the model is able to synthesize motion or gesture of the players moving with corresponding melody and rhythm.

our TAgVM model

Examples

Music Performance Videos

generation_stage2_5_db_39_Jerusalem_26_28 1_vn_44_K515_15_5 2_tpt_42_Arioso_79_12 1_fl_40_Miserere_13_47

Landscape Videos

fire_crackling_136_6_34 fire_crackling_141_2_1 splashing_water_143_6_17 squishing_water_136_8_38

Failure

underwater_bubbling_119_7_21 raining_145_3_37

Setup

  1. Create the virtual environment
conda create -n tav python==3.9
conda activate tav
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers
  1. Create a saved_ckpts folder to download pretrained checkpoints.

Datasets

We create two three-modality datasets named as URMP-VAT and Landscape-VAT, where there are four folders (mp4, stft_pickle, audio, txt) in each training dataset or testing dataset.

You can download these processed datasets to datasets folder.

URMP-VAT (Landscape-VAT)
  |---train
    |---mp4
    |---stft_pickle
    |---txt
    |---wav
  |---test
    |---mp4
    |---stft_pickle
    |---txt
    |---wav

Download pre-trained checkpoints

DatasetVQGANGPTDiffusion
URMP-VATURMP-VAT_video_VQGAN.ckptURMP-VAT_GPT.ckptURMP-VAT_diffusion.pt
Landscape-VATLandscape-VAT_video_VQGAN.ckptLandscape-VAT_GPT.ckptLandscape-VAT_diffusion.pt

Since we utilize AudioCLIP model to encode audio and images, you can download the checkpoints in their open project page.

Sampling Procedure

Sample Short Music Performance Videos

  • gpt_text_ckpt: path to GPT checkpoint
  • vqgan_ckpt: path to video VQGAN checkpoint
  • data_path: path to dataset, you can change it to post_landscape for Landscape-VAT dataset
  • load_vid_len: for URMP-VAT, it is set to 90 (fps=30); for Landscape-VAT, it is set to 30 (fps=10)
  • text_emb_model: model to encode text, choices: bert, clip
  • audio_emb_model: model to encode audio, choices: audioclip, wav2clip
  • text_stft_cond: load text-audio-video data
  • n_sample: the number of videos need to be sampled
  • run: index for each run
  • resolution: resolution used in training video VQGAN procedure
  • model_output_size: the resolution when training the diffusion model
  • audio_guidance_lambda: coefficient to control audio guidance
  • direction_lambda: coefficient to control semantic change consistency of audio and video
  • text_guidance_lambda: coefficient to control text guidance
  • diffusion_ckpt: path to diffusion model
python scripts/sample_tav.py --gpt_text_ckpt saved_ckpts/best_checkpoint-val_text_loss=2.74.ckpt --text_stft_cond \
--vqgan_ckpt saved_ckpts/epoch=6-step=35999-train_recon_loss=0.15.ckpt --text_emb_model bert \
--data_path datasets/post_URMP/ --top_k 2048 --top_p 0.80 --n_sample 50 --run 17 --dataset URMP --load_vid_len 90 \
--audio_emb_model audioclip --resolution 96 --batch_size 1 --model_output_size 128 --noise_schedule cosine \
--iterations_num 1 --audio_guidance_lambda 10000 --direction_lambda 5000 --text_guidance_lambda 10000 \
--diffusion_ckpt saved_ckpts/model300000.pt

Calculate Evaluation Metrics

  • exp_tag: name of result folder, which is under results folder
  • audio_folder: audio folder name, default: audio
  • video_folder: video folder name, choices: fake_stage1, fake_stage2, real
  • txt_folder: text folder name, default: txt
  • CLIP audio score
python tools/clip_score/clip_audio.py --exp_tag 1_tav_URMP --audio_folder audio --video_folder fake_stage2 --audio_emb_model audioclip
  • CLIP text score
python tools/clip_score/clip_text.py --exp_tag 1_tav_URMP --txt_folder txt --video_folder fake_stage2 --batch_size 5
  • real_folder: ground-truth video folder name, default: real
  • fake_folder: generated video folder name, choices: fake_stage1, fake_stage2
  • mode: mode to calculate FVD, FID scores, choices: full, size
  • FVD
python tools/tf_fvd/fvd.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full
  • FID
python tools/tf_fvd/fid.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full

Training Procedure

You can also train the models on customized datasets. Here we provide the command to train VQGAN, Transformer and Diffusion models.

video VQGAN

  • embedding_dim: dimension of codebook embeddings, default: 256
  • n_codes: size of codebook, default: 16384
  • n_hiddens: hidden channels base, default: 32
  • downsample: ratio of downsampling, default: 4 8 8, 4 for temporal dimension and 8 8 for spatial dimension
  • lr: learning rate
  • data_path: path to dataset
  • default_root_dir: path to save checkpoints
  • resolution: video resolution to train, we set 96 for URMP-VAT, 64 for Landscape-VAT
  • sequence_length: length of videos, default: 16
python scripts/train_vqgan.py --embedding_dim 256 --n_codes 16384 --n_hiddens 32 --downsample 4 8 8 --no_random_restart \
--gpus 1 --batch_size 8 --num_workers 16 --accumulate_grad_batches 6 --progress_bar_refresh_rate 100 --max_steps 100000 \
--gradient_clip_val 1.0 --lr 6.0e-5 --data_path datasets/post_URMP/ --default_root_dir path/to/save \
--resolution 9 --sequence_length 16 --discriminator_iter_start 10000 --norm_type batch --perceptual_weight 4 \
--image_gan_weight 1 --video_gan_weight 1  --gan_feat_weight 4

Transformer

  • first_stage_key: load first stage data from a batch, default: video
  • cond1_stage_key: load condition stage data from a batch, default: text
  • vqvae: path to load pretrained VQGAN checkpoints
  • n_layer: the number of layers in transformer
  • n_head: the number of heads in transformer
  • n_embd: dimension of embeddings in transformer
  • text_seq_len: maximum length of text tokens
  • embd_pdrop: dropout ratio in embedding step
  • resid_pdrop: dropout ratio in transformer blocks
  • attn_pdrop: dropout ratio in attention blocks
python scripts/train_text_transformer.py --num_workers 4 --val_check_interval 0.5 --progress_bar_refresh_rate 100 \
--gpus 1 --sync_batchnorm --batch_size 8 --first_stage_key video --cond1_stage_key text --text_stft_cond --text_emb_model bert \
--vqvae path/to/video/vqgan --data_path datasets/post_URMP/ --load_vid_len 30 --default_root_dir path/to/save \
--base_lr 4.5e-05 --first_stage_vocab_size 16384 --block_size 1024 --n_layer 12 --n_head 8 --n_embd 512 --resolution 96 \
--sequence_length 16 --text_seq_len 12 --max_steps 500000 --embd_pdrop 0.2 --resid_pdrop 0.2 --attn_pdrop 0.2

Diffusion

  • save_dir: path to save checkpoints
  • diffusion_steps: the number of steps to denoise
  • noise_schedule: choices: cosine, linear
  • num_channels: latent channels base
  • num_res_blocks: the number of resnet blocks in diffusion
  • class_cond: whether using class or not
  • image_size: resolution of videos/images
python scripts/diffusion_video_train_3d.py --num_workers 8 --gpus 1 --batch_size 1 --text_stft_cond --data_path datasets/post_URMP/ \
--load_vid_len 30 --save_dir path/to/save --resolution 128 --sequence_length 16 --diffusion_steps 4000 --noise_schedule cosine \
--lr 5e-5 --num_channels 128 --num_res_blocks 3 --class_cond False  --log_interval 50 --save_interval 5000 --image_size 128 --learn_sigma True

We use 3D diffusion here, setting dims=3 in U-Net for convenience.

Acknowledgements

Our code is based on TATS and blended-diffusion.

Citation

If you find our work useful, please consider citing our paper.

@article{zhao2024ta2v,
  title={Ta2v: Text-audio guided video generation},
  author={Zhao, Minglu and Wang, Wenmin and Chen, Tongbao and Zhang, Rui and Li, Ruochen},
  journal={IEEE Transactions on Multimedia},
  volume={26},
  pages={7250--7264},
  year={2024},
  publisher={IEEE}
}