TexTalker: Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture (CVPR 2025)

January 12, 2026 ยท View on GitHub

arXiv Project Page


Abstract

teaser

Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset TexTalk4D, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework TexTalker to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements.


TexTalk4D Dataset

Google Drive: Download

  • TexTalkData.zip: Containing 70 seen IDs. The latter half of the speech from ID063 to ID072 is not included in the training set and is used for calculating quantitative metrics, i.e., TexTalk4D-Test-A in the paper.

  • TexTalkTest.zip: Containing 18 unseen IDs used for qualitative evaluation.

  • TexTalkDataV2.zip: Containing 8 unused IDs. The original dataset lacks subjects with forehead wrinkles. Therefore, we have added 8 IDs who exhibit this feature to support future research."

    Due to storage and ethical constraints, the download link only provides textures with a resolution of 512. Please contact us with your identification details if you need higher-resolution data. We release the neutral 8K texture in /TexTalkDataset/ID001/Models/000001/face.png, for example. You can calculate the wrinkle map from the 512 resolution maps, resize it to a higher resolution, and combine it with the high-resolution neutral texture to approximate the high-resolution dynamic textures.

    Some subjects requested to withdraw their data, resulting in a discrepancy between the currently available dataset and the one described in the publication.


Install

conda create -n textalker python=3.9 -y
conda activate textalker
# pytorch
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

pip install -r requirements.txt
# pytorch3d
pip install fvcore iopath
pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu116_pyt1131/download.html
# basicsr
python basicsr/setup.py develop
# face3d
git clone https://github.com/YadiraF/face3d
cd face3d/mesh/cython
python setup.py build_ext -i 

Train

Preprocess your dataset

Note: The organization of the dataset directory should be the same as the data we provide.

  1. Generate the offset map of the mesh sequence.
python ./scripts/gen_obj_diff.py --input_dir "path to your training set root"
  1. Train the motion vae. Modify the options/VQGAN_Motion.yml, set dataroot_gt: "/data/your_usrname/TexTalkData".
CUDA_VISIBLE_DEVICES=0,1,2,3 \
./scripts/dist_train.sh 4 options/VQGAN_Motion.yml
  1. Train the texture vae. Modify the options/VQGAN_Texture.yml, set dataroot_gt: "/data/your_usrname/TexTalkData".
CUDA_VISIBLE_DEVICES=0,1,2,3 \
./scripts/dist_train.sh 4 options/VQGAN_Texture.yml
  1. Generate motion latent features using the motion vae.
python ./scripts/gen_latent_gt_mesh.py --input_dir "path to your training set root" \ 
--model_path "path to your motion vae checkpoint.pth"
  1. Generate texture latent features using the texture vae.
python ./scripts/gen_latent_gt_tex.py --input_dir "path to your training set root" \ 
--model_path "path to your texture vae checkpoint.pth"
  1. Generate the pivot features.
python ./scripts/gen_latent_ave.py --input_dir "path to your training set root"
  1. Train the textalker model.
CUDA_VISIBLE_DEVICES=0,1,2,3 \
./scripts/dist_train.sh 4 options/TexTalker.yml 

Inference

Quick start

  1. Download the checkpoints and put them to ./checkpoints. Google Drive: Download
  2. Run the inference. Require the template obj, texture, motion pivot and texture pivot.
python inference.py --tex_decoder_path "checkpoints/tex_vae.pth" \
--motion_decoder_path "checkpoints/motion_vae.pth" \
--model_path "checkpoints/textalker.pth" \
--input "example/test_id" \
--audio_path "example/Records/enhanced_vocal.wav" \
--output "results"

You can use your own template mesh and generate the pivots by Steps 4-6 of the training stage.

Acknowledgement

This work is built on awesome research works and open-source projects, thanks a lot to all the authors.


Citation

If our work is useful for your research, please consider citing:

@inproceedings{li2025towards,
  title={Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture},
  author={Li, Xuanchen and Wang, Jianyu and Cheng, Yuhao and Zeng, Yikun and Ren, Xingyu and Zhu, Wenhan and Zhao, Weiming and Yan, Yichao},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={204--214},
  year={2025}
}