README.md

April 13, 2026 · View on GitHub

Tango

Tango: Taming Visual Signals for Efficient Video Large Language Models

TL;DR: We identify critical gaps in video token pruning and advance two predominant paradigms: saliency-based token selection (🔑 signal: attention weight) and diversity-oriented token merging (🔑 signal: cosine similarity).

✨ Highlights

Main contributions: We analyze the characteristics of pivotal visual signals and rethink how to utilize them more effectively (More details and findings in our paper).

Tango Teaser
(Left) Motivation of our method. (Right) Method overview.

Attention Weight: The distribution is multi-modal and long-tailed, which a vanilla Top-k strategy fails to capture accurately.
➡️ Our approach: Expand the candidate set (cover the tail) and perform intra-cluster selection (cover diverse modes).
Cosine Similarity: Direct similarity-based clustering often creates fragmented clusters, leading to noisy representations after average pooling.
➡️ Our approach: Inject a spatio-temporal locality prior (for smoothness) using our proposed ST-RoPE.

🛠️ Quick Setup

Create a conda virtual environment and install the required packages.

conda create -n Tango python=3.10
conda activate Tango
pip install -r requirements.txt

Install Flash Attention 2.

pip install -U flash-attn --no-build-isolation

Install evaluation frameworks.

# For main performance evaluation
pip install -e ./VLMEvalKit
# For efficiency analysis
pip install -e ./lmms-eval

💡 Evaluation

Performance Evaluation

We adopt the VLMEvalKit framework for performance evaluation, with retention ratios in {0.1, 0.15, 0.2}.

We currently support LLaVA-OneVision-7B, LLaVA-Video-7B, and Qwen2.5-VL-7B models.

cd VLMEvalKit/
# Evaluate on LLaVA-OneVision-7B
bash run_eval_ov.sh
# Evaluate on LLaVA-Video-7B (w/ intra-LLM pruning)
bash run_eval_video.sh
# Evaluate on Qwen2.5-VL-7B
bash run_eval_qwen.sh

Efficiency Profiling

We adopt the lmms-eval framework for efficiency profiling. Here's a sample script for evaluation under retention ratio 0.1.

WRAPPER=tango accelerate launch --num_processes=8 \
  -m lmms_eval \
  --model llava_onevision \
  --model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
  --tasks videomme \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs/

For reference, our results with 8 NVIDIA A800 GPUs are:

Metric	Value
Total_runtime (s)	1315.02
Total_GPU_runtime (s)	198.31
Peak_mem (GB)	18.61
Avg_ViT_Time (ms)	335.57
Avg_Other_Time (ms)	146.59
Avg_LLM_Prefill_Time (ms)	80.64
Avg_Total_TTFT (ms)	562.80
Avg_Decoding_Throughput (token/s)	83.66

Sparrow: An efficient training scheme for video LLMs.
Awesome-MLLM: A project keeping track of new papers and the latest developments in the field of MLLMs.

🌻 Acknowledgement

Inspiring works with open-sourced implementation: VisionZip, FastVID, HoliTom.
Our efficiency profiling implementation is extended upon VidCom².

🖋️ Citation

If you find our project useful, please consider citing our paper:

@article{yin2026tango,
  title={Tango: Taming Visual Signals for Efficient Video Large Language Models},
  author={Yin, Shukang and Zhao, Sirui and Wang, Hanchao and Jia, Baozhi and Wang, Xianquan and Fu, Chaoyou and Chen, Enhong},
  journal={arXiv preprint arXiv:2604.09547},
  year={2026}
}