README.md
January 14, 2026 Ā· View on GitHub
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
š Paper Ā· ā GitHub Ā· š Dataset Ā· š¤ Checkpoints
Key Features:
-
MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.
-
With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.
-
Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.
-
Example Videos:
https://github.com/user-attachments/assets/bb0ef0d5-99ee-4c4d-8236-87b894381ffb
https://github.com/user-attachments/assets/46a5de08-6c56-4595-a763-bd4ed5d2c02f
Quick Start: A Real-World Demo with your own laptop camera!
Here we assume you have a GPU server as backend, and a laptop with camera as frontend:
- On the GPU server, create conda environment and start the backend server:
cd demo/
conda create -n mmduet2-infer python=3.10
conda activate mmduet2-infer
pip install -r requirements.txt
python api_server.py
- Download
demo/frontend.pyto laptop and start the frontend:
pip install requests, opencv-python
python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000 # (your server ip)
After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.
Training and Inference
-
For SFT, follow the instructions in train/README.md
-
For RL, follow the instructions in rl/README.md
-
For proactive inference and evaluation, follow the instructions in proactive_eval/README.md
-
When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.
Star History
Acknowledgement
We thank the following projects for their open-source contributions: