DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model
February 7, 2026 ยท View on GitHub
Paper | Webpage | Online Gradio Web Demo | Wandb Training Logs
We are gradually releasing the code for this project.
TODO List
- Offline video generation
- Gradio Demo
- Online video generation
- Training code
Online Demo
You can try out our model directly via the Online Gradio Web Demo.
Setup
Minimum Requirement: GPU with 10GB VRAM.
Environment
Create a Python environment using conda:
conda create -n dystream_py11 python=3.11
conda activate dystream_py11
pip install -r requirements.txt
Download Checkpoints
Download the required checkpoints and tools:
git clone https://huggingface.co/robinwitch/DyStream
cd DyStream
mv tools ../
mv checkpoints ../
cd ..
rm -rf DyStream
Quick Start
Launch the Gradio Web Demo
CUDA_VISIBLE_DEVICES=0 python -u app.py
Alternatively, you can start the demo using the provided shell script: Run the demo with a single command:
bash run.sh
Batch Inference with Custom Data
Configuration
Configuration files can be referenced and changed in data_json/sample_files.json. We provide examples for two scenarios:
- Speaker audio only
- Speaker and listener audio tracks
Scenario 1: Speaker Audio Only
Example configuration:
{
"origin_video_path": null,
"resampled_video_path": "img_files/11.png",
"audio_path": "wav_files/11.wav",
"audio_self_path": "wav_files/11.wav",
"audio_other_path": null,
"motion_self_path": "img_files/11.npz",
"motion_other_path": null,
"mode": "test_wild",
"dataset_type": "dyadic",
"video_id": "single_speaker_11_11"
}
To use your own image and audio:
- Modify the following fields:
resampled_video_path,audio_path,audio_self_path,motion_self_path, andvideo_id - Required files:
resampled_video_pathandaudio_self_pathmust exist audio_pathshould be identical toaudio_self_pathin this scenariomotion_self_pathcan be set by changing the file extension ofresampled_video_pathto.npz. This file will be automatically generated during runtime if it doesn't existvideo_idcan be any identifier for organizing your outputs
Scenario 2: Speaker and Listener Audio
Example configuration:
{
"origin_video_path": null,
"resampled_video_path": "img_files/3.png",
"audio_path": "wav_files/_sgIH81kj78-Scene-005+audio_full.wav",
"audio_self_path": "wav_files/_sgIH81kj78-Scene-005+audio_v3_1.wav",
"audio_other_path": "wav_files/_sgIH81kj78-Scene-005+audio_v3_0.wav",
"motion_self_path": "img_files/3.npz",
"motion_other_path": null,
"mode": "test_wild",
"dataset_type": "dyadic",
"video_id": "_sgIH81kj78-Scene-005+audio_v3_2"
}
To use your own image and audio:
- Modify the following fields:
resampled_video_path,audio_path,audio_self_path,audio_other_path,motion_self_path, andvideo_id - Required files:
resampled_video_path,audio_self_path, andaudio_other_pathmust exist audio_self_path: speaker audio trackaudio_other_path: listener audio trackaudio_path: combined audio containing both speaker and listener tracks. This is only used for final video rendering and audio merging, not for inferencemotion_self_pathcan be set by changing the file extension ofresampled_video_pathto.npz. This file will be automatically generated during runtime if it doesn't existvideo_idcan be any identifier for organizing your outputs
Citation
If you find this work useful, please consider citing:
@article{chen2025dystream,
title={DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model},
author={Bohong Chen and Haiyang Liu},
journal={ArXiv},
year={2025},
volume={abs/2512.24408},
}