DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

February 7, 2026 · View on GitHub

Paper | Webpage | Online Gradio Web Demo | Wandb Training Logs

We are gradually releasing the code for this project.

TODO List

Offline video generation
Gradio Demo
Online video generation
Training code

Online Demo

You can try out our model directly via the Online Gradio Web Demo.

Setup

Minimum Requirement: GPU with 10GB VRAM.

Environment

Create a Python environment using conda:

conda create -n dystream_py11 python=3.11
conda activate dystream_py11
pip install -r requirements.txt

Download Checkpoints

Download the required checkpoints and tools:

git clone https://huggingface.co/robinwitch/DyStream
cd DyStream
mv tools ../
mv checkpoints ../
cd ..
rm -rf DyStream

Quick Start

Launch the Gradio Web Demo

CUDA_VISIBLE_DEVICES=0 python -u app.py

Alternatively, you can start the demo using the provided shell script: Run the demo with a single command:

bash run.sh

Batch Inference with Custom Data

Configuration

Configuration files can be referenced and changed in data_json/sample_files.json. We provide examples for two scenarios:

Speaker audio only
Speaker and listener audio tracks

Scenario 1: Speaker Audio Only

Example configuration:

{
    "origin_video_path": null,
    "resampled_video_path": "img_files/11.png",
    "audio_path": "wav_files/11.wav",
    "audio_self_path": "wav_files/11.wav",
    "audio_other_path": null,
    "motion_self_path": "img_files/11.npz",
    "motion_other_path": null,
    "mode": "test_wild",
    "dataset_type": "dyadic",
    "video_id": "single_speaker_11_11"
}

To use your own image and audio:

Modify the following fields: resampled_video_path, audio_path, audio_self_path, motion_self_path, and video_id
Required files: resampled_video_path and audio_self_path must exist
audio_path should be identical to audio_self_path in this scenario
motion_self_path can be set by changing the file extension of resampled_video_path to .npz. This file will be automatically generated during runtime if it doesn't exist
video_id can be any identifier for organizing your outputs

Scenario 2: Speaker and Listener Audio

Example configuration:

{
    "origin_video_path": null,
    "resampled_video_path": "img_files/3.png",
    "audio_path": "wav_files/_sgIH81kj78-Scene-005+audio_full.wav",
    "audio_self_path": "wav_files/_sgIH81kj78-Scene-005+audio_v3_1.wav",
    "audio_other_path": "wav_files/_sgIH81kj78-Scene-005+audio_v3_0.wav",
    "motion_self_path": "img_files/3.npz",
    "motion_other_path": null,
    "mode": "test_wild",
    "dataset_type": "dyadic",
    "video_id": "_sgIH81kj78-Scene-005+audio_v3_2"
}

To use your own image and audio:

Modify the following fields: resampled_video_path, audio_path, audio_self_path, audio_other_path, motion_self_path, and video_id
Required files: resampled_video_path, audio_self_path, and audio_other_path must exist
audio_self_path: speaker audio track
audio_other_path: listener audio track
audio_path: combined audio containing both speaker and listener tracks. This is only used for final video rendering and audio merging, not for inference
motion_self_path can be set by changing the file extension of resampled_video_path to .npz. This file will be automatically generated during runtime if it doesn't exist
video_id can be any identifier for organizing your outputs

Citation

If you find this work useful, please consider citing:

@article{chen2025dystream,
  title={DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model},
  author={Bohong Chen and Haiyang Liu},
  journal={ArXiv},
  year={2025},
  volume={abs/2512.24408},
}