README.md

March 1, 2026 · View on GitHub

Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

Junyan Lin^1,2, Junlong Tong^2,3, Hao Wu², Jialiang Zhang^2,4,

Jinming Liu^2,3, Xin Jin², Xiaoyu Shen²

¹Department of Computing, The Hong Kong Polytechnic University

²Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT

³Shanghai Jiao Tong University

⁴Ocean University of China

This repository is the unified implementation for **Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models**.

The project trains/evaluates on PE-Video, using human_caption as ground truth.

The figure compares three paradigms for video description:

Offline: the model first watches the whole video and then describes it, which can cause temporal misalignment.
Interleaved streaming: perception and generation alternate frame-by-frame, improving responsiveness but still constrained by positional continuity.
Parallel streaming (ours): positional continuity between input and output is relaxed, enabling perception and generation to run in parallel for true real-time understanding.

What Is Kept

One training script: code/qwen-vl-finetune/scripts/sft.sh
One evaluation launcher: eval.sh
One dataset preparation script: mani_data.py

1) Environment Setup

From project root:

# 1) Create and activate conda env
conda create -n qwen_stream python=3.10 -y
conda activate qwen_stream

# 2) Upgrade pip
pip install --upgrade pip

# 3) Install project dependencies
pip install -r requirements.txt

# 4) Install local qwen-vl-utils package used by training/eval
pip install -e code/qwen-vl-utils

Optional (speed-up): install flash-attn if your CUDA/toolchain is compatible.

2) Prepare PE-Video Dataset

From project root:

python mani_data.py

What mani_data.py does:

checks local dataset/PE-Video/train and dataset/PE-Video/test
if missing, downloads facebook/PE-Video from HuggingFace
filters with:
- 5 <= video_duration_in_s <= 30
- 3 <= token_len(human_caption) / duration <= 5
writes:
- dataset/train_3_5.jsonl
- dataset/test_3_5.jsonl

Tokenizer for counting tokens defaults to:

Qwen/Qwen2.5-VL-7B-Instruct

3) Training

From project root:

cd code/qwen-vl-finetune
export QWEN2_5_VL_VARIANT=origin   # origin / batch / group / gap / overlap / interleave
export QWEN2_5_VL_MODEL_PATH=Qwen/Qwen2.5-VL-7B-Instruct
export QWEN2_5_VL_TRAIN_DATA=../dataset/train_3_5.jsonl
bash scripts/sft.sh

Output checkpoint:

code/qwen-vl-finetune/output/qwen2_5vl-pe-${QWEN2_5_VL_VARIANT}

4) Evaluation

From project root:

export QWEN2_5_VL_VARIANT=origin
export QWEN2_5_VL_EVAL_MODEL_PATH=code/qwen-vl-finetune/output/qwen2_5vl-pe-origin
export QWEN2_5_VL_EVAL_DATA_DIR=dataset/test_3_5.jsonl
export QWEN2_5_VL_EVAL_IMG_ROOT=dataset/PE-Video/videos/test
bash eval.sh

Output file defaults to:

code/evaluation/output/${QWEN2_5_VL_VARIANT}_infer.json

This repository does not implement true multi-GPU parallel execution for simultaneous perception and generation. The current implementation focuses on position-id redesign and model adaptation to this position-id behavior. In future work, we plan to release a true multi-GPU parallel implementation for end-to-end real-time deployment.

README.md

Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

What Is Kept

Directory

1) Environment Setup

2) Prepare PE-Video Dataset

3) Training

4) Evaluation

Note on Parallelism