README.md

January 14, 2026 Ā· View on GitHub

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning


šŸ“– Paper Ā· ⭐ GitHub Ā· šŸ“Š Dataset Ā· šŸ¤— Checkpoints

Key Features:

  • MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.

  • With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.

  • Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.

  • Example Videos:

https://github.com/user-attachments/assets/bb0ef0d5-99ee-4c4d-8236-87b894381ffb

https://github.com/user-attachments/assets/46a5de08-6c56-4595-a763-bd4ed5d2c02f

Quick Start: A Real-World Demo with your own laptop camera!

Here we assume you have a GPU server as backend, and a laptop with camera as frontend:

  • On the GPU server, create conda environment and start the backend server:
cd demo/
conda create -n mmduet2-infer python=3.10
conda activate mmduet2-infer
pip install -r requirements.txt
python api_server.py
  • Download demo/frontend.py to laptop and start the frontend:
pip install requests, opencv-python
python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000   # (your server ip)

After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.

Training and Inference

  • For SFT, follow the instructions in train/README.md

  • For RL, follow the instructions in rl/README.md

  • For proactive inference and evaluation, follow the instructions in proactive_eval/README.md

  • When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.

Star History

Star History Chart

Acknowledgement

We thank the following projects for their open-source contributions: