README.md

April 12, 2026 · View on GitHub

Awesome MLLM Compression

This repository contains a regularly updated paper list for MLLM Compression.

Last Commit

From Data to Model: A Survey of the Compression Lifecycle in MLLMs

Hao Wu^*,1, Junlong Tong^*,1,2, Xudong Wang¹, Yang Tan³, Changyu Zeng¹, Anastasia Antsiferova¹, Xiaoyu Shen^†,1

¹Institute of Digital Twin, Eastern Institute of Technology, Ningbo

²Shanghai Jiao Tong University, ³Southeast University, ⁴Innopolis University

^* Core Contribution, ^† Corresponding Author.

Contact: haowu.ai.research@gmail.com, xyshen@eitech.edu.cn

If you find our paper of this resource helpful, please consider cite:

@article{Wu_2026,
    title={From Data to Model: A Survey of the Compression Lifecycle in MLLMs},
    url={http://dx.doi.org/10.36227/techrxiv.177220375.55495124/v1},
    DOI={10.36227/techrxiv.177220375.55495124/v1},
    publisher={Institute of Electrical and Electronics Engineers (IEEE)},
    author={Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu},
    year={2026},
    month=feb 
}

Important

We actively maintain this repository and welcome community contributions. If you would like to:

Add newly released MLLM compression papers
Propose refinements to our taxonomy
Correct or update existing entries
Discuss classification or methodology

Please submit a pull request or contact the authors.

🔥News

[2026.02.27] The preprint is now published!

💡 Highlights

Lifecycle perspective for MLLM compression: We introduce a Data-to-Model view that organizes compression methods according to where compression occurs in the MLLM pipeline, including the Input, Encoder, Projector, and LLM stages.
Five fundamental compression operations: We distill existing methods into five fundamental operations: Dropping, Aggregation, Encoding, Resampling, and Skipping, providing a unified abstraction for analyzing compression strategies.
Joint compression across efficiency dimensions: We advocate jointly considering token compression, operation compression, and KV cache compression as complementary strategies for improving the efficiency of MLLMs.
Cross-level compression coordination: We advocate that coordinated compression across multiple pipeline levels provides a more effective way to balance efficiency and model performance.
Beyond efficiency-oriented compression: We argue that compression should not be viewed solely as an efficiency technique, but also as a design principle that can reshape representations, architectures, and multimodal processing in MLLMs.

📚 Contents

News: Latest updates, news, and announcements.
Highlights: Core insights and perspectives that this survey aims to emphasize.
Tag Description: Brief explanation of tags in this repository.
Libraries: A collection of MLLM compression papers compiled in this repository.
License: License information for this repository.
Acknowledgments: Credits to projects and contributors that inspired or supported this work.
Contact: Contact information for questions, feedback, or collaboration.
Related Projects: Research projects from our group (EIT-NLP) related to MLLM compression.

📋 Tag Description

for preprint papers.
for conference or journal papers.
for GitHub repositories.
for research areas (primarily categorized by modality).
for compression positions (i.e., Input, Encoder, Projector, LLM)
for compression operation types (i.e., Dropping, Aggregation, Encoding, Resampling, Skipping)
for specific compression mechanisms (the third level in our taxonomy).
for compression dimensions (i.e., Token Compression, Operation Compression, KV Cache Compression)
for training cost (i.e., Training-Free, Retraining, Post-Training).

🔔 Please check out the papers by selecting the sub-area you are interested in. Within each sub-area, papers are organized according to our compression taxonomy. The main page presents all survey papers, together with major conference (i.e., ICML, NeurIPS, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL) papers from the past year and recently released papers within the last six months. Note that papers already included in the major conference papers from the past year are excluded from the recent papers.

Awesome MLLM Compression
- Image
- Video
  - Video LLM (TODO)
- Audio
- 3D
- Omni
  - Omni LLM (TODO)
Survey
Recent Papers
Published in Recent Conferences

Survey

Title & Authors & Links	Date	Taxonomy	Highlight
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li	26.4.07
From Data to Model: A Survey of the Compression Lifecycle in MLLMs Hao Wu, Junlong Tong, Xudong Wang, Yang Tan, Changyu Zeng, Anastasia Antsiferova, Xiaoyu Shen	26.2.27	Compression position & Compression operation & Mechanisim	Compression Lifecycle
Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, Wenjun Zeng	26.01.28	Codec & Token Technology	Compression as Intelligence
Towards Efficient Multimodal Large Language Models: A Survey on Token Compression Linli Yao, Long Xing, Yang Shi, Sida Li, Yuanxin Liu, Yuhao Dong, Yi-Fan Zhang, Lei Li, Qingxiu Dong, Xiaoyi Dong, Qidong Huang, Haotian Wang, Feng Wu, Yuanxing Zhang, Pengfei Wan, Zhouchen Lin, Xu Sun	26.01.12	Compression Position & Mechanisim	-
Revisiting MLLM Token Technology through the Lens of Classical Visual Coding Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin	25.08.19	Codec & Token Technology	-
A Survey of Token Compression for Efficient Multimodal Large Language Models Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang	25.07.27	Modality & Mechanisim	Modality-centric
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik	25.05.23	Compression operation	Compression Beyond Efficiency

Recent Papers (Last 6 Months)

Image (TODO)

Title & Authors & Links	Areas	Tags

Video (TODO)

Title & Authors & Links	Areas	Tags

Audio

Title & Authors & Links	Areas	Tags
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic
Segmentwise Pruning in Audio-Language Models Marcel Gibier, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre
Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

Title & Authors & Links	Areas	Tags
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu

Omni (TODO)

Title & Authors & Links	Areas	Tags

Published in Recent Conferences (Last 12 months)

CVPR 2026 (TODO)

Title & Authors & Links	Areas	Tags
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang

ICLR 2026 (TODO)

Title & Authors & Links	Areas	Tags
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen

EMNLP 2025 (TODO)

Title & Authors & Links	Areas	Tags
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma

NeurIPS 2025 (TODO)

Title & Authors & Links	Areas	Tags
FastVID: Dynamic Density Pruning for Fast Video Large Language Models Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models Haichao Zhang, Yun Fu
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, Zhuotao Tian
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

ICCV 2025 (TODO)

Title & Authors & Links	Areas	Tags
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Mark Endo, Xiaohan Wang, Serena Yeung-Levy
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang

ICML 2025 (TODO)

Title & Authors & Links	Areas	Tags
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

ACL 2025 (TODO)

Title & Authors & Links	Areas	Tags
PruneVid: Visual Token Pruning for Efficient Video Large Language Models Xiaohu Huang, Hao Zhou, Kai Han
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro

NAACL 2025

Title & Authors & Links	Areas	Tags
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro

Hao Wu: haowu.ai.research@gmail.com
Xiaoyu Shen: xyshen@eitech.edu.cn

Vision Encoder
- [CVPR 26] UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
ImageLLM

README.md

Awesome MLLM Compression

🔥News

💡 Highlights

📚 Contents

📋 Tag Description

🌟 Libraries

Survey

Recent Papers (Last 6 Months)

Published in Recent Conferences (Last 12 months)

📄 License

🙏 Acknowledgments

✉️ Contact