TA2V: Text-Audio Guided Video Generation

December 1, 2025 · View on GitHub

This is the official implement of our proposed Text&Audio-guided Video Maker (TAgVM) model of TA2V task. Since we pay more attention to music performance video generation, given both the text prompt and the audio signals as input, the model is able to synthesize motion or gesture of the players moving with corresponding melody and rhythm.

Examples

Music Performance Videos

generation_stage2_5_db_39_Jerusalem_26_28 1_vn_44_K515_15_5 2_tpt_42_Arioso_79_12 1_fl_40_Miserere_13_47

Landscape Videos

fire_crackling_136_6_34 fire_crackling_141_2_1 splashing_water_143_6_17 squishing_water_136_8_38

Failure

underwater_bubbling_119_7_21 raining_145_3_37

Setup

Create the virtual environment

conda create -n tav python==3.9
conda activate tav
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers

Create a saved_ckpts folder to download pretrained checkpoints.

Datasets

We create two three-modality datasets named as URMP-VAT and Landscape-VAT, where there are four folders (mp4, stft_pickle, audio, txt) in each training dataset or testing dataset.

You can download these processed datasets to datasets folder.

URMP-VAT (Landscape-VAT)
  |---train
    |---mp4
    |---stft_pickle
    |---txt
    |---wav
  |---test
    |---mp4
    |---stft_pickle
    |---txt
    |---wav

Download pre-trained checkpoints

Dataset	VQGAN	GPT	Diffusion
URMP-VAT	URMP-VAT_video_VQGAN.ckpt	URMP-VAT_GPT.ckpt	URMP-VAT_diffusion.pt
Landscape-VAT	Landscape-VAT_video_VQGAN.ckpt	Landscape-VAT_GPT.ckpt	Landscape-VAT_diffusion.pt

Since we utilize AudioCLIP model to encode audio and images, you can download the checkpoints in their open project page.

Sampling Procedure

Sample Short Music Performance Videos

gpt_text_ckpt: path to GPT checkpoint
vqgan_ckpt: path to video VQGAN checkpoint
data_path: path to dataset, you can change it to post_landscape for Landscape-VAT dataset
load_vid_len: for URMP-VAT, it is set to 90 (fps=30); for Landscape-VAT, it is set to 30 (fps=10)
text_emb_model: model to encode text, choices: bert, clip
audio_emb_model: model to encode audio, choices: audioclip, wav2clip
text_stft_cond: load text-audio-video data
n_sample: the number of videos need to be sampled
run: index for each run
resolution: resolution used in training video VQGAN procedure
model_output_size: the resolution when training the diffusion model
audio_guidance_lambda: coefficient to control audio guidance
direction_lambda: coefficient to control semantic change consistency of audio and video
text_guidance_lambda: coefficient to control text guidance
diffusion_ckpt: path to diffusion model

python scripts/sample_tav.py --gpt_text_ckpt saved_ckpts/best_checkpoint-val_text_loss=2.74.ckpt --text_stft_cond \
--vqgan_ckpt saved_ckpts/epoch=6-step=35999-train_recon_loss=0.15.ckpt --text_emb_model bert \
--data_path datasets/post_URMP/ --top_k 2048 --top_p 0.80 --n_sample 50 --run 17 --dataset URMP --load_vid_len 90 \
--audio_emb_model audioclip --resolution 96 --batch_size 1 --model_output_size 128 --noise_schedule cosine \
--iterations_num 1 --audio_guidance_lambda 10000 --direction_lambda 5000 --text_guidance_lambda 10000 \
--diffusion_ckpt saved_ckpts/model300000.pt

Calculate Evaluation Metrics

exp_tag: name of result folder, which is under results folder
audio_folder: audio folder name, default: audio
video_folder: video folder name, choices: fake_stage1, fake_stage2, real
txt_folder: text folder name, default: txt

CLIP audio score

python tools/clip_score/clip_audio.py --exp_tag 1_tav_URMP --audio_folder audio --video_folder fake_stage2 --audio_emb_model audioclip

CLIP text score

python tools/clip_score/clip_text.py --exp_tag 1_tav_URMP --txt_folder txt --video_folder fake_stage2 --batch_size 5

real_folder: ground-truth video folder name, default: real
fake_folder: generated video folder name, choices: fake_stage1, fake_stage2
mode: mode to calculate FVD, FID scores, choices: full, size

FVD

python tools/tf_fvd/fvd.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full

FID

python tools/tf_fvd/fid.py --exp_tag 1_tav_URMP --real_folder real --fake_folder fake_stage2 --mode full

Training Procedure

You can also train the models on customized datasets. Here we provide the command to train VQGAN, Transformer and Diffusion models.

video VQGAN

embedding_dim: dimension of codebook embeddings, default: 256
n_codes: size of codebook, default: 16384
n_hiddens: hidden channels base, default: 32
downsample: ratio of downsampling, default: 4 8 8, 4 for temporal dimension and 8 8 for spatial dimension
lr: learning rate
data_path: path to dataset
default_root_dir: path to save checkpoints
resolution: video resolution to train, we set 96 for URMP-VAT, 64 for Landscape-VAT
sequence_length: length of videos, default: 16

python scripts/train_vqgan.py --embedding_dim 256 --n_codes 16384 --n_hiddens 32 --downsample 4 8 8 --no_random_restart \
--gpus 1 --batch_size 8 --num_workers 16 --accumulate_grad_batches 6 --progress_bar_refresh_rate 100 --max_steps 100000 \
--gradient_clip_val 1.0 --lr 6.0e-5 --data_path datasets/post_URMP/ --default_root_dir path/to/save \
--resolution 9 --sequence_length 16 --discriminator_iter_start 10000 --norm_type batch --perceptual_weight 4 \
--image_gan_weight 1 --video_gan_weight 1  --gan_feat_weight 4

Transformer

first_stage_key: load first stage data from a batch, default: video
cond1_stage_key: load condition stage data from a batch, default: text
vqvae: path to load pretrained VQGAN checkpoints
n_layer: the number of layers in transformer
n_head: the number of heads in transformer
n_embd: dimension of embeddings in transformer
text_seq_len: maximum length of text tokens
embd_pdrop: dropout ratio in embedding step
resid_pdrop: dropout ratio in transformer blocks
attn_pdrop: dropout ratio in attention blocks

python scripts/train_text_transformer.py --num_workers 4 --val_check_interval 0.5 --progress_bar_refresh_rate 100 \
--gpus 1 --sync_batchnorm --batch_size 8 --first_stage_key video --cond1_stage_key text --text_stft_cond --text_emb_model bert \
--vqvae path/to/video/vqgan --data_path datasets/post_URMP/ --load_vid_len 30 --default_root_dir path/to/save \
--base_lr 4.5e-05 --first_stage_vocab_size 16384 --block_size 1024 --n_layer 12 --n_head 8 --n_embd 512 --resolution 96 \
--sequence_length 16 --text_seq_len 12 --max_steps 500000 --embd_pdrop 0.2 --resid_pdrop 0.2 --attn_pdrop 0.2

Diffusion

save_dir: path to save checkpoints
diffusion_steps: the number of steps to denoise
noise_schedule: choices: cosine, linear
num_channels: latent channels base
num_res_blocks: the number of resnet blocks in diffusion
class_cond: whether using class or not
image_size: resolution of videos/images

python scripts/diffusion_video_train_3d.py --num_workers 8 --gpus 1 --batch_size 1 --text_stft_cond --data_path datasets/post_URMP/ \
--load_vid_len 30 --save_dir path/to/save --resolution 128 --sequence_length 16 --diffusion_steps 4000 --noise_schedule cosine \
--lr 5e-5 --num_channels 128 --num_res_blocks 3 --class_cond False  --log_interval 50 --save_interval 5000 --image_size 128 --learn_sigma True

We use 3D diffusion here, setting dims=3 in U-Net for convenience.

Acknowledgements

Our code is based on TATS and blended-diffusion.

Citation

If you find our work useful, please consider citing our paper.

@article{zhao2024ta2v,
  title={Ta2v: Text-audio guided video generation},
  author={Zhao, Minglu and Wang, Wenmin and Chen, Tongbao and Zhang, Rui and Li, Ruochen},
  journal={IEEE Transactions on Multimedia},
  volume={26},
  pages={7250--7264},
  year={2024},
  publisher={IEEE}
}