README.md
June 10, 2026 Β· View on GitHub
Stage-adaptive Token Selection for Efficient Omni-modal LLMs
π’ News
- [2026/06/07] π Released SEATS code for Qwen2.5-Omni-7B, with LMMs-Eval adaptation and baselines.
- [2026/05/19] π Paper released on arXiv and project page is online.
π Overview
SEATS is a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, it reveals that visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers.
β¨ Key Highlights
- π‘ New Insight: Reveals a block-wise dependence pattern in omni-modal LLMs, where reliance on visual and audio tokens weakens with layer depth.
- β‘ Strong Efficiency: 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while preserving 96.3% performance.
- π― Stage-adaptive Design: Diversity-based pre-LLM selection + query-guided inner-LLM progressive pruning + late-layer full removal.
- π Broad Compatibility: Plug-and-play and training-free for direct application to Qwen2.5-Omni-7B and Qwen3-Omni-30B.
π TODO
- Support Qwen2.5-Omni-7B
- Release benchmark adaptation code for LMMs-Eval (WorldSense, Daily-Omni, OmniVideoBench, Video-MME, LVOmniBench)
- Evaluation scripts and reproduction guide (adapted for LMMs-Eval)
- Release more baseline implementations (FastV, VisionZip, Random)
- Support Qwen3-Omni-30B
- Release more baseline implementations (DivPrune, DyCoke, and OmniZip)
- future work: Support more models (OmniVinci-7B)
ποΈ Method
SEATS is a three-stage method:
- Pre-LLM Token Selection: Removes spatiotemporal redundancy within each temporal window via attention-weighted diversity selection.
- Inner-LLM Token Selection: Progressively prunes tokens with a block-wise token retention ratio decay schedule and top-down budget allocation (inter-window then intra-window) guided by query relevance.
- Late-block Removal: Removes all remaining non-textual tokens in late layers where cross-modal fusion is complete.
π§ Dependencies and Installation
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.
# git clone this repository
git clone https://github.com/xxayt/SEATS.git
cd SEATS
# create a new anaconda env
conda create -n SEATS_env python=3.10 -y
conda activate SEATS_env
# install dependencies
bash scripts/base/setup.sh
# install the bundled lmms-eval in editable mode
cd lmms-eval
pip install -e .
cd ..
# (Recommended) install torch and flash-attn
# pip install torch==2.8.0 torchvision==0.23.0
pip install flash-attn --no-build-isolation
π Evaluation
We adapt 5 omni-modal benchmarks into LMMs-Eval, so you can run them directly through this repository. Please first download the corresponding annotation data and videos from the links below.
| Benchmark | Data | Videos | Task name |
|---|---|---|---|
| Daily-Omni | xxayt/Daily-Omni | liarliar/Daily-Omni | dailyomni |
| WorldSense | lmms-lab/WorldSense | lmms-lab/WorldSense | worldsense |
| OmniVideoBench | xxayt/OmniVideoBench | NJU-LINK/OmniVideoBench | omnivideobench |
| Video-MME | lmms-lab/Video-MME | lmms-lab/Video-MME | videomme |
| LVOmniBench | xxayt/LVOmniBench | KD-TAO/LVOmniBench | lvomnibench |
Once the data is ready, launch evaluation with the scripts under scripts/. Results are written to output/. We implement qwen2_5_omni_zip as a unified LMMs-Eval model wrapper that dispatches to SEATS and all baselines for omni-modal LLM token compression.
Full tokens
bash scripts/eval_qwen2_5_omni_full_tokens.sh
SEATS (our method)
To evaluate our SEATS method on the five benchmarks, use the following command:
bash scripts/eval_qwen2_5_omni_seats.sh
You can customize the compression settings by editing:
scripts/eval_qwen2_5_omni_seats.shβtasks_list(which benchmarks to run) andratio_pairs(per-modality token retention budgets, swept over multiple settings).seats/config.yamlβ SEATS method hyperparameters (e.g., progressive drop layers, late-block layer, window size).
Baselines
We also provide the following scripts to evaluate the baseline methods adapted for omni-modal LLMs:
bash scripts/eval_qwen2_5_omni_random.sh # Random
bash scripts/eval_qwen2_5_omni_fastv.sh # FastV
bash scripts/eval_qwen2_5_omni_fastv_omni.sh # FastV-om
bash scripts/eval_qwen2_5_omni_visionzip.sh # VisionZip
bash scripts/eval_qwen2_5_omni_visionzip_omni.sh # VisionZip-om
... # more to be added
π Repo Structure
SEATS/
βββ scripts/ # Shell entry points (one per method) + shared base
β βββ base/
β β βββ setup.sh # Python dependency installation
β β βββ eval_qwen2_5_omni_zip.sh # Shared accelerate + lmms-eval launcher
β βββ eval_qwen2_5_omni_seats.sh # SEATS (our method)
β βββ ...
βββ seats/ # SEATS three-stage implementation
β βββ pre_llm_units.py # Stage I: winDivPrune
β βββ inner_llm_units.py # Stage II: inner-LLM stage-adaptive selection
β βββ ratio_decay_scheduler.py # block-wise TRR decay schedule
β βββ modeling_qwen2_5_omni_seats.py # patched Thinker / TextModel forwards
β βββ config.yaml # SEATS hyperparameters
βββ baselines/ # Per-method patches; one subfolder per baseline
β βββ utils.py # apply_zip_method_patch() dispatcher
β βββ full_tokens/ # No compression (config only)
β βββ visionzip_omni/ # VisionZip adapted for omni-modal
β βββ ...
βββ models/qwen2_5_omni/ # Vendored Qwen2.5-Omni model code
βββ lmms-eval/ # Vendored LMMs-Eval (registers `qwen2_5_omni_zip`)
π€ Acknowledgement
This implementation relies on resources from Qwen2.5-Omni, Qwen3-Omni, LMMs-Eval, OmniZip, VisionZip, and DivPrune. We thank the original authors for their excellent contributions and for making their work publicly available.
βοΈ Citation
If you find this work useful, please consider citing:
@article{xin2026seats,
title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and Lyu, Jing and Li, Xirong},
journal={arXiv preprint arXiv:2605.20035},
year={2026}
}
π License
This project is licensed under the MIT License. For commercial licensing or any use beyond research, please contact the authors.
π¬ Contact for Issues
For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn