Awesome MLLM Compression
From Data to Model: A Survey of the Compression Lifecycle in MLLMs
Hao Wu*,1,
Junlong Tong*,1,2,
Xudong Wang1,
Yang Tan3,
Changyu Zeng1,
Anastasia Antsiferova1,
Xiaoyu Shenβ ,1
1Institute of Digital Twin, Eastern Institute of Technology, Ningbo
2Shanghai Jiao Tong University, 3Southeast University, 4Innopolis University
* Core Contribution, β Corresponding Author.
Contact: haowu.ai.research@gmail.com, xyshen@eitech.edu.cn
If you find our paper of this resource helpful, please consider cite:
@article{Wu_2026,
title={From Data to Model: A Survey of the Compression Lifecycle in MLLMs},
url={http://dx.doi.org/10.36227/techrxiv.177220375.55495124/v1},
DOI={10.36227/techrxiv.177220375.55495124/v1},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
author={Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu},
year={2026},
month=feb
}
Important
We actively maintain this repository and welcome community contributions.
If you would like to:
- Add newly released MLLM compression papers
- Propose refinements to our taxonomy
- Correct or update existing entries
- Discuss classification or methodology
Please submit a pull request or contact the authors.
- [2026.02.27] The preprint is now published!
- Lifecycle perspective for MLLM compression: We introduce a Data-to-Model view that organizes compression methods according to where compression occurs in the MLLM pipeline, including the Input, Encoder, Projector, and LLM stages.
- Five fundamental compression operations: We distill existing methods into five fundamental operations: Dropping, Aggregation, Encoding, Resampling, and Skipping, providing a unified abstraction for analyzing compression strategies.
- Joint compression across efficiency dimensions: We advocate jointly considering token compression, operation compression, and KV cache compression as complementary strategies for improving the efficiency of MLLMs.
- Cross-level compression coordination: We advocate that coordinated compression across multiple pipeline levels provides a more effective way to balance efficiency and model performance.
- Beyond efficiency-oriented compression: We argue that compression should not be viewed solely as an efficiency technique, but also as a design principle that can reshape representations, architectures, and multimodal processing in MLLMs.
- News: Latest updates, news, and announcements.
- Highlights: Core insights and perspectives that this survey aims to emphasize.
- Tag Description: Brief explanation of tags in this repository.
- Libraries: A collection of MLLM compression papers compiled in this repository.
- License: License information for this repository.
- Acknowledgments: Credits to projects and contributors that inspired or supported this work.
- Contact: Contact information for questions, feedback, or collaboration.
- Related Projects: Research projects from our group (EIT-NLP) related to MLLM compression.
for preprint papers.
for conference or journal papers.
for GitHub repositories.
for research areas (primarily categorized by modality).
for compression positions (i.e., Input, Encoder, Projector, LLM)
for compression operation types (i.e., Dropping, Aggregation, Encoding, Resampling, Skipping)
for specific compression mechanisms (the third level in our taxonomy).
for compression dimensions (i.e., Token Compression, Operation Compression, KV Cache Compression)
for training cost (i.e., Training-Free, Retraining, Post-Training).
π Please check out the papers by selecting the sub-area you are interested in. Within each sub-area, papers are organized according to our compression taxonomy. The main page presents all survey papers, together with major conference (i.e., ICML, NeurIPS, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL) papers from the past year and recently released papers within the last six months. Note that papers already included in the major conference papers from the past year are excluded from the recent papers.
| Title & Authors & Links | Date | Taxonomy | Highlight |
|---|
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li | 26.4.07 | | |
From Data to Model: A Survey of the Compression Lifecycle in MLLMs Hao Wu, Junlong Tong, Xudong Wang, Yang Tan, Changyu Zeng, Anastasia Antsiferova, Xiaoyu Shen | 26.2.27 | Compression position & Compression operation & Mechanisim | Compression Lifecycle |
 Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, Wenjun Zeng | 26.01.28 | Codec & Token Technology | Compression as Intelligence |
Towards Efficient Multimodal Large Language Models: A Survey on Token Compression Linli Yao, Long Xing, Yang Shi, Sida Li, Yuanxin Liu, Yuhao Dong, Yi-Fan Zhang, Lei Li, Qingxiu Dong, Xiaoyi Dong, Qidong Huang, Haotian Wang, Feng Wu, Yuanxing Zhang, Pengfei Wan, Zhouchen Lin, Xu Sun | 26.01.12 | Compression Position & Mechanisim | - |
 Revisiting MLLM Token Technology through the Lens of Classical Visual Coding Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin | 25.08.19 | Codec & Token Technology | - |
A Survey of Token Compression for Efficient Multimodal Large Language Models Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | 25.07.27 | Modality & Mechanisim | Modality-centric |
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | 25.05.23 | Compression operation | Compression Beyond Efficiency |
Image (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
Video (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
Audio
| Title & Authors & Links | Areas | Tags |
|---|
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic |  |  |
Segmentwise Pruning in Audio-Language Models Marcel Gibier, RaphaΓ«l Duroselle, Pierre Serrano, Olivier Boeffard, Jean-FranΓ§ois Bonastre |  |  |
Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass |  |  |
3D
| Title & Authors & Links | Areas | Tags |
|---|
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen |  |  |
HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu |  |  |
Omni (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
CVPR 2026 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen |  |  |
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang |  |  |
ICLR 2026 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen |  |  |
EMNLP 2025 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma |  |  |
NeurIPS 2025 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
FastVID: Dynamic Density Pruning for Fast Video Large Language Models Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding |  |  |
VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models Haichao Zhang, Yun Fu |  |  |
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, Zhuotao Tian |  |  |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra |  |  |
ICCV 2025 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang |  |  |
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon |  |  |
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin |  |  |
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan |  |  |
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Mark Endo, Xiaohan Wang, Serena Yeung-Levy |  |  |
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang |  |  |
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang |  |  |
ICML 2025 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang |  |  |
ACL 2025 (TODO)
| Title & Authors & Links | Areas | Tags |
|---|
PruneVid: Visual Token Pruning for Efficient Video Large Language Models Xiaohu Huang, Hao Zhou, Kai Han |  |  |
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro |  |  |
NAACL 2025
| Title & Authors & Links | Areas | Tags |
|---|
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal |  |  |
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang |  |  |
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro |  |  |
This project is released under the MIT License.
This repository is inspired by Awesome-Multimodal-Token-Compression, Awesome-Latent-CoT, and Awesome-Efficient-LLM.
For questions, suggestions, or collaboration opportunities, please feel free to reach out: