README.md

January 14, 2026 · View on GitHub

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

📖 Paper · ⭐ GitHub · 📊 Dataset · 🤗 Checkpoints

Key Features:

MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.
With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.
Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.
Example Videos:

Here we assume you have a GPU server as backend, and a laptop with camera as frontend:

cd demo/
conda create -n mmduet2-infer python=3.10
conda activate mmduet2-infer
pip install -r requirements.txt
python api_server.py

pip install requests, opencv-python
python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000   # (your server ip)

After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.

For SFT, follow the instructions in train/README.md
For RL, follow the instructions in rl/README.md
For proactive inference and evaluation, follow the instructions in proactive_eval/README.md
When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.

We thank the following projects for their open-source contributions: