README.md

June 10, 2026 · View on GitHub

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Zijie Xin¹ Jie Yang^2,📧 Ruixiang Zhao¹ Tianyi Wang² Fengyun Rao² Jing Lyu² Xirong Li^1,📧

📧 Corresponding authors

¹ Renmin University of China ² WeChat Vision, Tencent Inc.

📢 News

[2026/06/07] 🚀 Released SEATS code for Qwen2.5-Omni-7B, with LMMs-Eval adaptation and baselines.
[2026/05/19] 📄 Paper released on arXiv and project page is online.

SEATS is a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, it reveals that visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers.

✨ Key Highlights

💡 New Insight: Reveals a block-wise dependence pattern in omni-modal LLMs, where reliance on visual and audio tokens weakens with layer depth.
⚡ Strong Efficiency: 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while preserving 96.3% performance.
🎯 Stage-adaptive Design: Diversity-based pre-LLM selection + query-guided inner-LLM progressive pruning + late-layer full removal.
🔌 Broad Compatibility: Plug-and-play and training-free for direct application to Qwen2.5-Omni-7B and Qwen3-Omni-30B.

📅 TODO

Support Qwen2.5-Omni-7B
Release benchmark adaptation code for LMMs-Eval (WorldSense, Daily-Omni, OmniVideoBench, Video-MME, LVOmniBench)
Evaluation scripts and reproduction guide (adapted for LMMs-Eval)
Release more baseline implementations (FastV, VisionZip, Random)
Support Qwen3-Omni-30B
Release more baseline implementations (DivPrune, DyCoke, and OmniZip)
future work: Support more models (OmniVinci-7B)

🏗️ Method

Method SEATS is a three-stage method:

Pre-LLM Token Selection: Removes spatiotemporal redundancy within each temporal window via attention-weighted diversity selection.
Inner-LLM Token Selection: Progressively prunes tokens with a block-wise token retention ratio decay schedule and top-down budget allocation (inter-window then intra-window) guided by query relevance.
Late-block Removal: Removes all remaining non-textual tokens in late layers where cross-modal fusion is complete.

🔧 Dependencies and Installation

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

# git clone this repository
git clone https://github.com/xxayt/SEATS.git
cd SEATS

# create a new anaconda env
conda create -n SEATS_env python=3.10 -y
conda activate SEATS_env

# install dependencies
bash scripts/base/setup.sh

# install the bundled lmms-eval in editable mode
cd lmms-eval
pip install -e .
cd ..

# (Recommended) install torch and flash-attn
# pip install torch==2.8.0 torchvision==0.23.0
pip install flash-attn --no-build-isolation

🚀 Evaluation

We adapt 5 omni-modal benchmarks into LMMs-Eval, so you can run them directly through this repository. Please first download the corresponding annotation data and videos from the links below.

Benchmark	Data	Videos	Task name
Daily-Omni	xxayt/Daily-Omni	liarliar/Daily-Omni	`dailyomni`
WorldSense	lmms-lab/WorldSense	lmms-lab/WorldSense	`worldsense`
OmniVideoBench	xxayt/OmniVideoBench	NJU-LINK/OmniVideoBench	`omnivideobench`
Video-MME	lmms-lab/Video-MME	lmms-lab/Video-MME	`videomme`
LVOmniBench	xxayt/LVOmniBench	KD-TAO/LVOmniBench	`lvomnibench`

Once the data is ready, launch evaluation with the scripts under scripts/. Results are written to output/. We implement qwen2_5_omni_zip as a unified LMMs-Eval model wrapper that dispatches to SEATS and all baselines for omni-modal LLM token compression.

Full tokens

bash scripts/eval_qwen2_5_omni_full_tokens.sh

SEATS (our method)

To evaluate our SEATS method on the five benchmarks, use the following command:

bash scripts/eval_qwen2_5_omni_seats.sh

You can customize the compression settings by editing:

scripts/eval_qwen2_5_omni_seats.sh — tasks_list (which benchmarks to run) and ratio_pairs (per-modality token retention budgets, swept over multiple settings).
seats/config.yaml — SEATS method hyperparameters (e.g., progressive drop layers, late-block layer, window size).

Baselines

We also provide the following scripts to evaluate the baseline methods adapted for omni-modal LLMs:

bash scripts/eval_qwen2_5_omni_random.sh         # Random
bash scripts/eval_qwen2_5_omni_fastv.sh          # FastV
bash scripts/eval_qwen2_5_omni_fastv_omni.sh     # FastV-om
bash scripts/eval_qwen2_5_omni_visionzip.sh      # VisionZip
bash scripts/eval_qwen2_5_omni_visionzip_omni.sh # VisionZip-om

... # more to be added

📁 Repo Structure

SEATS/
├── scripts/                          # Shell entry points (one per method) + shared base
│   ├── base/
│   │   ├── setup.sh                  # Python dependency installation
│   │   └── eval_qwen2_5_omni_zip.sh  # Shared accelerate + lmms-eval launcher
│   ├── eval_qwen2_5_omni_seats.sh    # SEATS (our method)
│   └── ...
├── seats/                            # SEATS three-stage implementation
│   ├── pre_llm_units.py              # Stage I: winDivPrune
│   ├── inner_llm_units.py            # Stage II: inner-LLM stage-adaptive selection
│   ├── ratio_decay_scheduler.py      # block-wise TRR decay schedule
│   ├── modeling_qwen2_5_omni_seats.py # patched Thinker / TextModel forwards
│   └── config.yaml                   # SEATS hyperparameters
├── baselines/                        # Per-method patches; one subfolder per baseline
│   ├── utils.py                      # apply_zip_method_patch() dispatcher
│   ├── full_tokens/                  # No compression (config only)
│   ├── visionzip_omni/               # VisionZip adapted for omni-modal
│   └── ...
├── models/qwen2_5_omni/              # Vendored Qwen2.5-Omni model code
└── lmms-eval/                        # Vendored LMMs-Eval (registers `qwen2_5_omni_zip`)

🤝 Acknowledgement

This implementation relies on resources from Qwen2.5-Omni, Qwen3-Omni, LMMs-Eval, OmniZip, VisionZip, and DivPrune. We thank the original authors for their excellent contributions and for making their work publicly available.

✏️ Citation

If you find this work useful, please consider citing:

@article{xin2026seats,
  title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
  author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and Lyu, Jing and Li, Xirong},
  journal={arXiv preprint arXiv:2605.20035},
  year={2026}
}

📜 License

This project is licensed under the MIT License. For commercial licensing or any use beyond research, please contact the authors.

📬 Contact for Issues

For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn