README.md

June 10, 2026 Β· View on GitHub

SEATS

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

         
Zijie Xin1  Jie Yang2,πŸ“§β€ƒ Ruixiang Zhao1  Tianyi Wang2  Fengyun Rao2  Jing Lyu2  Xirong Li1,πŸ“§β€ƒ
πŸ“§ Corresponding authors
1 Renmin University of China  2 WeChat Vision, Tencent Inc. 

πŸ“’ News

  • [2026/06/07] πŸš€ Released SEATS code for Qwen2.5-Omni-7B, with LMMs-Eval adaptation and baselines.
  • [2026/05/19] πŸ“„ Paper released on arXiv and project page is online.

πŸ‘€ Overview

SEATS is a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, it reveals that visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers.

✨ Key Highlights

  • πŸ’‘ New Insight: Reveals a block-wise dependence pattern in omni-modal LLMs, where reliance on visual and audio tokens weakens with layer depth.
  • ⚑ Strong Efficiency: 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while preserving 96.3% performance.
  • 🎯 Stage-adaptive Design: Diversity-based pre-LLM selection + query-guided inner-LLM progressive pruning + late-layer full removal.
  • πŸ”Œ Broad Compatibility: Plug-and-play and training-free for direct application to Qwen2.5-Omni-7B and Qwen3-Omni-30B.

πŸ“… TODO

  • Support Qwen2.5-Omni-7B
  • Release benchmark adaptation code for LMMs-Eval (WorldSense, Daily-Omni, OmniVideoBench, Video-MME, LVOmniBench)
  • Evaluation scripts and reproduction guide (adapted for LMMs-Eval)
  • Release more baseline implementations (FastV, VisionZip, Random)
  • Support Qwen3-Omni-30B
  • Release more baseline implementations (DivPrune, DyCoke, and OmniZip)
  • future work: Support more models (OmniVinci-7B)

πŸ—οΈ Method

Method SEATS is a three-stage method:

  1. Pre-LLM Token Selection: Removes spatiotemporal redundancy within each temporal window via attention-weighted diversity selection.
  2. Inner-LLM Token Selection: Progressively prunes tokens with a block-wise token retention ratio decay schedule and top-down budget allocation (inter-window then intra-window) guided by query relevance.
  3. Late-block Removal: Removes all remaining non-textual tokens in late layers where cross-modal fusion is complete.

πŸ”§ Dependencies and Installation

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

# git clone this repository
git clone https://github.com/xxayt/SEATS.git
cd SEATS

# create a new anaconda env
conda create -n SEATS_env python=3.10 -y
conda activate SEATS_env

# install dependencies
bash scripts/base/setup.sh

# install the bundled lmms-eval in editable mode
cd lmms-eval
pip install -e .
cd ..

# (Recommended) install torch and flash-attn
# pip install torch==2.8.0 torchvision==0.23.0
pip install flash-attn --no-build-isolation

πŸš€ Evaluation

We adapt 5 omni-modal benchmarks into LMMs-Eval, so you can run them directly through this repository. Please first download the corresponding annotation data and videos from the links below.

BenchmarkDataVideosTask name
Daily-Omnixxayt/Daily-Omniliarliar/Daily-Omnidailyomni
WorldSenselmms-lab/WorldSenselmms-lab/WorldSenseworldsense
OmniVideoBenchxxayt/OmniVideoBenchNJU-LINK/OmniVideoBenchomnivideobench
Video-MMElmms-lab/Video-MMElmms-lab/Video-MMEvideomme
LVOmniBenchxxayt/LVOmniBenchKD-TAO/LVOmniBenchlvomnibench

Once the data is ready, launch evaluation with the scripts under scripts/. Results are written to output/. We implement qwen2_5_omni_zip as a unified LMMs-Eval model wrapper that dispatches to SEATS and all baselines for omni-modal LLM token compression.

Full tokens

bash scripts/eval_qwen2_5_omni_full_tokens.sh

SEATS (our method)

To evaluate our SEATS method on the five benchmarks, use the following command:

bash scripts/eval_qwen2_5_omni_seats.sh

You can customize the compression settings by editing:

  • scripts/eval_qwen2_5_omni_seats.sh β€” tasks_list (which benchmarks to run) and ratio_pairs (per-modality token retention budgets, swept over multiple settings).
  • seats/config.yaml β€” SEATS method hyperparameters (e.g., progressive drop layers, late-block layer, window size).

Baselines

We also provide the following scripts to evaluate the baseline methods adapted for omni-modal LLMs:

bash scripts/eval_qwen2_5_omni_random.sh         # Random
bash scripts/eval_qwen2_5_omni_fastv.sh          # FastV
bash scripts/eval_qwen2_5_omni_fastv_omni.sh     # FastV-om
bash scripts/eval_qwen2_5_omni_visionzip.sh      # VisionZip
bash scripts/eval_qwen2_5_omni_visionzip_omni.sh # VisionZip-om

... # more to be added

πŸ“ Repo Structure

SEATS/
β”œβ”€β”€ scripts/                          # Shell entry points (one per method) + shared base
β”‚   β”œβ”€β”€ base/
β”‚   β”‚   β”œβ”€β”€ setup.sh                  # Python dependency installation
β”‚   β”‚   └── eval_qwen2_5_omni_zip.sh  # Shared accelerate + lmms-eval launcher
β”‚   β”œβ”€β”€ eval_qwen2_5_omni_seats.sh    # SEATS (our method)
β”‚   └── ...
β”œβ”€β”€ seats/                            # SEATS three-stage implementation
β”‚   β”œβ”€β”€ pre_llm_units.py              # Stage I: winDivPrune
β”‚   β”œβ”€β”€ inner_llm_units.py            # Stage II: inner-LLM stage-adaptive selection
β”‚   β”œβ”€β”€ ratio_decay_scheduler.py      # block-wise TRR decay schedule
β”‚   β”œβ”€β”€ modeling_qwen2_5_omni_seats.py # patched Thinker / TextModel forwards
β”‚   └── config.yaml                   # SEATS hyperparameters
β”œβ”€β”€ baselines/                        # Per-method patches; one subfolder per baseline
β”‚   β”œβ”€β”€ utils.py                      # apply_zip_method_patch() dispatcher
β”‚   β”œβ”€β”€ full_tokens/                  # No compression (config only)
β”‚   β”œβ”€β”€ visionzip_omni/               # VisionZip adapted for omni-modal
β”‚   └── ...
β”œβ”€β”€ models/qwen2_5_omni/              # Vendored Qwen2.5-Omni model code
└── lmms-eval/                        # Vendored LMMs-Eval (registers `qwen2_5_omni_zip`)

🀝 Acknowledgement

This implementation relies on resources from Qwen2.5-Omni, Qwen3-Omni, LMMs-Eval, OmniZip, VisionZip, and DivPrune. We thank the original authors for their excellent contributions and for making their work publicly available.

✏️ Citation

If you find this work useful, please consider citing:

@article{xin2026seats,
  title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
  author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and Lyu, Jing and Li, Xirong},
  journal={arXiv preprint arXiv:2605.20035},
  year={2026}
}

πŸ“œ License

This project is licensed under the MIT License. For commercial licensing or any use beyond research, please contact the authors.

πŸ“¬ Contact for Issues

For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn