Towards Efficient Multimodal Large Language Models: A Survey on Token Compression
June 9, 2026 · View on GitHub
📢 Contributions Welcome
We appreciate contributions that help improve this repository and the accompanying paper. Please feel free to submit a pull request to:
- Add a missing or relevant paper.
- Propose a more suitable category or tag.
- Update or correct information (links, metadata, status).
- Addressing Potential Issues in Benchmarking.
- Request clarification or report an issue.
Thank you — every suggestion helps make this resource more useful.
⭐ If you find this repo useful, please give us a star :)
✒️ Table of Contents
- News: Latest Updates, News, and Announcements.
- About: Overview and Objectives.
- Tag Description: Brief Explanation of Tags in Paper Table.
- Paper Table: Paper Index (by Year, Descending).
- Benchmark: An overview of our proposed benchmark for MLLM token compression .
- Citation: If you find this helpful, please consider citing us.
🔥 News
- [2025.12.18] We've released the first version (v1.0) of the survey, which can be downloaded here.
- [2025.11.26] We've released the repository!
☀️ About
Multimodal Large Language Models (MLLMs) are rapidly expanding their capabilities, but high-resolution images and long videos create extremely long visual-token streams that dramatically increase compute, memory, and latency requirements. This repository accompanies our survey on Towards Efficient Multimodal Large Language Models: A Survey on Token Compression (Techriv) to help researchers and practitioners navigate this field.
Motivation. Token compression reduces the number of visual tokens processed by MLLMs while preserving critical cross-modal semantics, enabling more efficient training and faster inference without large accuracy regressions. The field is fragmented across encoders, projectors, and LLM-side techniques; a centralized, searchable resource is needed.
Target audience. Researchers, implementers, and system designers working on multimodal models, retrieval, efficient vision-language pipelines, and deployment at scale.
What this repo provides.
- A curated, searchable, chronologically organized paper index.
- Short annotated entries with metadata (method family, compression ratio, retrain vs plugin, modality).
- Links to code, checkpoints, and reproducibility notes where available.
- An overview of our proposed evaluation suite for MLLM token compression and benchmarks.
- Contribution guidelines and templates for adding papers or methods.
Feel free to browse the table, open issues, or contribute entries to help grow a rigorous, practical ecosystem for efficient multimodal modeling.
📋 Tag Description
redfor arXiv papersbluefor conference/journal paperswhitefor GitHub repositoriespurplefor modalitycyanfor compression positionbrightgreenfor whether it is text query-basedlightgreyfor compression methods: merge or pruningyellowfor usage mode: re-train or plug-inorangefor acceleration stage: Train stage or Inference Stagepinkfor compression ratio: fix or dynamicyellowgreenfor usage stage
📚 Paper Table
| Title & Authors | Date | Links | Modality & Position | Tags |
|---|---|---|---|---|
StateKV: Linear Scaling Video VLMs for Long Video Understanding Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles | 2026/05 | - | ||
EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision Rosario Forte, Giuseppe Lando, Antonino Furnari | 2026/05 | - | - | |
VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning Hengbo Xu, Shengjie Jin, Yanbiao Ma, Zhiwu Lu | 2026/05 | - | ||
EarlyTom: Early Token Compression Completes Fast Video Understanding Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang | 2026/05 | Project | ||
SEATS: Stage-adaptive Token Selection for Efficient Omni-modal LLMs Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li | 2026/05 | GitHub | ||
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models (VIF) Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang | 2026/05 | GitHub | ||
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong | 2026/05 | Project | ||
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao | 2026/05 | - | ||
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models (LiteLVLM) Sangin Lee, Yukyung Choi | 2026/05 | GitHub | ||
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao | 2026/05 | - | ||
ETCTrack: An Efficient Token Compression Framework for Visual Object Tracking Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song | 2026/05 | GitHub | ||
SAVEMem: Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding Hang Wu, Sherin Mary Mathews, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang | 2026/05 | - | ||
VLMaxxing through FrameMogging: Training-Free Anti-Recomputation for Video Vision-Language Models JF Bastien, Sam D'Amico | 2026/05 | GitHub | ||
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models Rinyoichi Takezoe, Yi Li, Zhe Bo, Atsuhiro Hou, Guang Meng, Keisuke Long | 2026/04 | - | ||
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs (SToP) Kibum Kim, Joonhwan Kim, Kyoungmin Min, Yuxuan Wang, Jeonghun Moon, Julian McAuley, Chanwoo Park | 2026/04 | GitHub | ||
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao | 2026/04 | - | ||
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling Jiatao Qu, Fengwei Zhou, Wenjing Li, Tao Wu, Guanxiong Xue, Zhicheng Zhao, Dong Wei, Yu Lu, Byunghan Na | 2026/04 | - | ||
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding (XComp) Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang | 2026/04 | GitHub | ||
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis Simin Huo, Ni Li | 2026/04 | GitHub | ||
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir | 2026/04 | Project | ||
Tango: Taming Visual Signals for Efficient Video Large Language Models Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen | 2026/04 | GitHub | ||
Do Vision Language Models Need to Process Image Tokens? Soumya Ghosh, R.V. Babu, Chirag Agarwal | 2026/04 | - | ||
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu | 2026/04 | Project | ||
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models Qihui Zhu, Tao Zhang, Yuchen Wang, Zijian Wen, Mengjie Zhang, Shuangwu Chen, Xiaobin Tan, Jian Yang, Yang Liu, Zhenhua Dong, Xianzhi Yu, Yinfei Pan | 2026/04 | - | ||
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference Zhaohong Huang, Wei Liu, Yuxin Zhang, Fei Chao, Rongrong Ji | 2026/04 | - | ||
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko | 2026/04 | - | ||
GroundVTS: Visual Token Sampling in MLLMs for Video Temporal Grounding Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai, Zhao Yang | 2026/04 | GitHub | ||
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding Yanjie Zhou, Yuxin Zhang, Jun Chen et al. | 2026/04 | GitHub | ||
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention Junhao Du et al. | 2026/03 | - | ||
AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding Haozhe Qi, Kevin Qu, Mahdi Rad et al. | 2026/03 | Project | ||
SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering Wenli Li, Kai Zhao, Haoran Jiang et al. | 2026/03 | - | ||
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee | 2026/03 | - | ||
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs (AwaRes) Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz | 2026/03 | Project | ||
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models Sijie Li, Biao Qian, Jungong Han | 2026/03 | - | ||
Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs Jaehoon Lee, Mingi Jung, Soohyuk Jang, Seungryong Yoo, Dahuin Jung, Sungroh Yoon | 2026/03 | - | ||
Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang | 2026/03 | GitHub | ||
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression Bingzhou Li, Tao Huang | 2026/03 | GitHub | ||
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference Surendra Pathak, Bo Han | 2026/03 | - | ||
AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin | 2026/03 | Project | ||
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao | 2026/03 | GitHub | ||
E-AdaPrune: Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models Jialuo He, Huangxun Chen | 2026/03 | - | ||
SemVID: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding Jiaqi Li, Shuntian Zheng, Yixian Shen, Xiangru Jian, Zhiqi Li, Yuncheng Li | 2026/03 | - | ||
EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs Yuhao Chen, Bin Shan, Xin Ye, Shu Wang, Jiashu Zhang | 2026/03 | - | ||
AOT: Token Reduction via Local and Global Contexts Optimization for Efficient Video LLMs Jinlong Li, Xinyu Li, Trong-Tung Nguyen, Yong Jae Lee, Nicu Sebe | 2026/03 | Project | ||
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs Changwoo Baek, Jouwon Song, Seunghun Lee, Jong-Ok Kim | 2026/03 | GitHub | ||
TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning Zhuo Chen, Shawn Young, Di Liu, Yuan Zhang, Wentao Zheng, Yilin Jia, Lijian Xu | 2026/03 | Code | ||
Stateful Token Reduction for Long-Video Hybrid VLMs Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu | 2026/03 | - | ||
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models Mouxiao Huang, Borui Jiang, Dehua Zheng, Hailin Hu, Kai Han, Xinghao Chen | 2026/02 | GitHub | ||
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen | 2026/02 | GitHub | ||
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao | 2026/02 | - | - | |
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang | 2026/02 | GitHub | ||
ApET: Approximation-Error Guided Token Compression for Efficient VLMs Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng | 2026/02 | GitHub | ||
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference Aditya Kumar Singh, Hsin-Pai Cheng, Sairam Sundaresan, Bhavin Jawade, Karthikeyan Saravanan, Saransh Rajput, Jinjun Xiong, Syed Zawad, Elias B. Khalil, Vijaykrishnan Narayanan | 2026/02 | GitHub | ||
EntropyPrune: Matrix Entropy Guided Visual Token Pruning for MLLMs Yahong Wang, Juncheng Wu, Zhangkai Ni, Chengmei Yang, Yihang Liu, Longzhen Yang | 2026/02 | GitHub | ||
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu, Jianchen Zhu | 2026/02 | - | ||
Vision Token Reduction via Attention-Driven Self-Compression for Efficient MLLMs Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan | 2026/02 | - | ||
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun | 2026/02 | - | ||
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models Xiangtian Zheng, Zishuo Wang, Yuxin Peng | 2026/02 | - | ||
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian | 2026/02 | GitHub | ||
Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu | 2026/02 | GitHub | ||
When LLaVA Meets Objects: Token Composition for Vision-Language-Models Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne | 2026/02 | - | ||
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang | 2026/02 | - | ||
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li | 2026/02 | GitHub | ||
Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen | 2026/02 | GitHub | ||
KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs Baiyang Song, Jun Peng, Yuxin Zhang, Guangyao Chen, Feidiao Yang, Jianyuan Guo | 2026/02 | - | ||
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass Chen Qian, Xinran Yu, Danyang Li, Guoxuan Chi, Zheng Yang, Qiang Ma | 2026/02 | - | ||
IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu | 2026/02 | GitHub | ||
Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning Yihong Huang, Fei Ma, Yihua Shao, Jingcai Guo, Zitong Yu, Laizhong Cui | 2026/02 | - | ||
CaCoVID: Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning Yinchao Ma, Qiang Zhou, Zhibin Wang, Zhen Song, Jionglong Su | 2026/02 | GitHub | ||
Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models Yihang Rao et al. | 2026/01 | - | ||
Environment-Aware Adaptive Pruning with Interleaved Inference Orchestration for Vision-Language-Action Models Yuting Huang, Leilei Ding, Zhipeng Tang, Zenghuan Zhu, Jiajun Deng | 2026/01 | - | ||
Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching Yujie Wei, Jiahan Fan, Jiyu Guo, Ruichen Zhen, Rui Shao, Xiu Su, Zeke Xie, Shuo Yang | 2026/01 | - | ||
CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models Samyak Jha, Junho Kim | 2026/01 | - | ||
ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning Xiaoshu Chen, Sihang Zhou, Ke Liang, Xinwang Liu | 2026/01 | - | ||
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu | 2026/01 | GitHub | ||
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang | 2026/01 | - | ||
Efficient Token Pruning for LLaDA-V Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang | 2026/01 | - | ||
Video-KTR: Reinforcing Video Reasoning via Key Token Attribution Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, Xudong Jiang | 2026/01 | GitHub | ||
ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning Wen Luo, Peng Chen, Xiaotao Huang, LiQun Huang | 2026/01 | - | ||
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He | 2026/01 | - | ||
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang | 2026/01 | - | ||
FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference Chaeyoung Jung, Youngjoon Jang, Seungwoo Lee, Joon Son Chung | 2026/01 | - | ||
Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu | 2026/01 | - | ||
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng | 2026/01 | GitHub | ||
STC: Accelerating Streaming Video Large Language Models via Hierarchical Token Compression Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang | 2025/12 | GitHub | ||
TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin | 2025/12 | - | ||
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang | 2025/12 | - | ||
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu | 2025/10 | GitHub | - | - |
Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, Xuming Hu | 2025/10 | GitHub | ||
ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim | 2025/09 | GitHub | ||
Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue | 2025/09 | - | ||
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors Xiangchen Wang, Jinrui Zhang, Teng Wang, Haigang Zhang, Feng Zheng | 2025/09 | GitHub Model | ||
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen | 2025/09 | GitHub | ||
TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin | 2025/09 | - | ||
Revisiting MLLM Token Technology through the Lens of Classical Visual Coding Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin | 2025/08 | - | - | - |
When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang | 2025/08 | - | ||
AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu | 2025/08 | - | ||
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin | 2025/08 | - | ||
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang | 2025/08 | GitHub | ||
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou | 2025/08 | GitHub Model | ||
MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang | 2025/08 | - | - | |
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering Kang Zeng, Guojin Zhong, Jintao Cheng, Jin Yuan, Zhiyong Li | 2025/08 | - | - | - |
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang | 2025/08 | - | - | - |
PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models Kai Zhao, Wubang Yuan, Alex Lingyu Hung, Dan Zeng | 2025/08 | - | - | - |
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian | 2025/08 | GitHub Project Page | - | - |
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li | 2025/08 | GitHub | ||
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, et al. | 2025/08 | GitHub Model | - | |
LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang | 2025/08 | GitHub | - | |
ADMIRE: ADaptive method to enhance Multiple Image REsolutions in text-rich multi-image understanding Qipeng Zhu, Xiong Wang, Zhihong Lu, Jiangwei Lao, Congyun Jin, Jie Chen, Yingzhe Peng, Qi Zhu, Lianzhen Zhong, Jiajia Liu, Peng Wei, Jian Wang | 2025/08 | - | ||
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang | 2025/08 | - | ||
HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen | 2025/08 | GitHub | ||
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian | 2025/07 | GitHub | ||
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata | 2025/07 | - | ||
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang | 2025/07 | GitHub | ||
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | 2025/07 | GitHub | - | - |
Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI Phat Nguyen, Ngai-Man Cheung | 2025/07 | - | - | - |
Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers Ji Ma, Wei Suo, Peng Wang, Yanning Zhang | 2025/07 | GitHub | ||
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim | 2025/07 | GitHub Project Page | ||
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia | 2025/07 | GitHub Model | ||
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen | 2025/07 | - | ||
Beyond Token Pruning: Operation Pruning in Vision-Language Models Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer | 2025/07 | GitHub | ||
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng | 2025/07 | - | ||
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment Rui Xu, Yunke Wang, Yong Luo, Bo Du | 2025/06 | - | ||
Learning Compact Vision Tokens for Efficient Large Multimodal Models Hao Tang, Chengchao Shen | 2025/06 | GitHub Model | ||
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang | 2025/06 | GitHub Project Page | - | |
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu | 2025/06 | GitHub Project Page Model | - | - |
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu | 2025/06 | GitHub | ||
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu | 2025/06 | - | ||
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu | 2025/06 | GitHub Project Page | ||
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin | 2025/06 | GitHub Project Page Model | ||
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou | 2025/06 | GitHub Model | ||
GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models Ruiguang Pei, Weiqing Sun, Zhihui Fu, Jun Wang | 2025/06 | - | ||
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang | 2025/06 | - | ||
Seed1.5-VL Technical Report ByteDance Seed | 2025/05 | GitHub Demo Homepage | ||
Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | 2025/05 | GitHub | - | - |
Corematching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive acceleration of vision-language models Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen | 2025/05 | GitHub | ||
Balanced token pruning: Accelerating vision language models beyond local optimization Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen | 2025/05 | GitHub | ||
ToDRE: Visual Token Pruning via Diversity and Task Relevance for Multimodal LLMs Duo Li, Zuhao Yang, Shijian Lu | 2025/05 | - | ||
HoliTom: Holistic Token Merging for Fast Video Large Language Models Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang | 2025/05 | GitHub Project Page | ||
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu | 2025/05 | GitHub | ||
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding | 2025/05 | - | ||
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang | 2025/05 | GitHub | ||
Clapper: Compact Learning and Video Representation in VLMs Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang | 2025/05 | - | ||
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang | 2025/05 | GitHub | ||
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang | 2025/05 | - | ||
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naïve Integration via Multi-Objective Balanced Covering Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu | 2025/05 | - | - | - |
Lossless Token Merging Even Without Fine-Tuning in Vision Transformers Jaeyeon Lee, Dong-Wan Choi | 2025/05 | - | - | |
Token Sequence Compression for Efficient Multimodal Computing Yasmine Omri, Parth Shroff, Thierry Tambe | 2025/04 | - | ||
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng | 2025/04 | - | ||
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polisetty, Yi R. Fung, Paul Pu Liang | 2025/04 | GitHub | - | |
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia | 2025/04 | GitHub | ||
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun | 2025/04 | GitHub Project Page Dataset Model | ||
DYMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu | 2025/04 | GitHub Project Page Model | ||
Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua | 2025/04 | GitHub Project Page | ||
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou | 2025/04 | GitHub | ||
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang | 2025/04 | - | ||
FastVID: Dynamic Density Pruning for Fast Video Large Language Models Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding | 2025/03 | GitHub | ||
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li | 2025/03 | GitHub | - | - |
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory Saket Gurukar, Asim Kadav | 2025/03 | - | ||
Tokencarve: Information-preserving visual token compression in multimodal large language models Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen | 2025/03 | GitHub | - | - |
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan | 2025/03 | GitHub | ||
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan | 2025/03 | - | ||
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao | 2025/03 | GitHub Model | ||
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Model Haichao Zhang, Yun Fu | 2025/03 | GitHub Project Page Model | ||
HICom: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie | 2025/03 | GitHub Model | ||
Similarity-Aware Token Pruning: Your VLM but Faster Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati | 2025/03 | GitHub | ||
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao | 2025/03 | GitHub | ||
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon | 2025/03 | Project Page | ||
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang | 2025/03 | GitHub | ||
When LVLM Meets Large RS Imagery: Coarse-to-Fine Text-Guided Token Pruning Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li | 2025/03 | GitHub Dataset | - | - |
Silent Hazards of Token Reduction in Vision-Language Models Yizheng Sun, Hao Li, Chang Xu, Hongpeng Zhou, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun | 2025/03 | - | - | - |
Prune and Merge: Efficient Token Compression For Vision Transformer With Spatial Information Preserved Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua | 2025/03 | GitHub | ||
Qwen2.5-VL Technical Report QwenTeam | 2025/02 | GitHub Model | ||
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang | 2025/02 | - | - | - |
Stop Looking for Important Tokens in Multimodal Language Models for Token Pruning Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang | 2025/02 | GitHub | ||
FCoT-VL: Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression Jianjian Li, Junquan Fan, Feng Tang, Gang Huang, Shitao Zhu, Songlin Liu, Nian Xie, Wulong Liu, Yong Liao | 2025/02 | - | ||
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang | 2025/02 | - | - | - |
Beyond Token Compression: A Training-Free Reduction Framework for Efficient Visual Processing in MLLMs Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin | 2025/01 | GitHub | ||
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal LLMs via Visual Registers Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, Liqiang Nie | 2025/01 | GitHub Project Page | ||
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro | 2025/01 | - | - | - |
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione | 2025/01 | GitHub | - | - |
Compression with Global Guidance: Towards Training-Free High-Resolution MLLMs Acceleration Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen | 2025/01 | GitHub | - | - |
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou | 2025/01 | GitHub | ||
DyRate: Dynamic Token Reduction during Generation for Vision Language Models Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji Hu | 2025/01 | GitHub | ||
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao | 2025/01 | GitHub | - | |
AdaFV: Rethinking of Visual-Language alignment for VLM acceleration Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng | 2025/01 | - | ||
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen | 2025/01 | GitHub | - | - |
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng | 2025/01 | GitHub Model | ||
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang | 2025/01 | GitHub Project Page | ||
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang | 2025/01 | GitHub Demo Model | ||
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Redundancy Modeling Ke Wang, Hong Xuan | 2024/12 | - | ||
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang | 2024/12 | GitHub | ||
LinVT: Empower Your Image-level Large Language Model to Understand Videos Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao | 2024/12 | GitHub | ||
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu | 2024/12 | - | ||
St3: Accelerating multimodal large language model by spatial-temporal visual token trimming Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu | 2024/12 | - | ||
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng | 2024/12 | GitHub | - | |
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Mark Endo, Xiaohan Wang, Serena Yeung-Levy | 2024/12 | GitHub | ||
PruneVid: Visual Token Pruning for Efficient Video Large Language Models Xiaohu Huang, Hao Zhou, Kai Han | 2024/12 | GitHub Project Page | ||
RETAKE: Reducing Temporal and Knowledge Redundancy for Long Video Understanding ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding | 2024/12 | GitHub | - | |
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification. Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, Shaohui Lin | 2024/12 | GitHub Model-7B Model-13B | ||
FastVLM: Efficient Vision Encoding for Vision Language Models Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari | 2024/12 | GitHub | ||
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai | 2024/12 | GitHub Model | ||
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang | 2024/12 | GitHub | - | |
VisionZip: Longer is Better but Not Necessary in Vision Language Models Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia | 2024/12 | GitHub Demo | ||
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, Limin Wang | 2024/12 | GitHub Model | ||
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang | 2024/12 | GitHub Project Page | ||
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang | 2024/12 | Project Page | ||
Accelerating multimodal large language models by searching optimal vision token reduction Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu | 2024/12 | - | ||
Token Cropr: Faster ViTs for Quite a Few Tasks Benjamin Bergner, Christoph Lippert, Aravindh Mahendran | 2024/12 | - | ||
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and Empirical Findings Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji | 2024/11 | GitHub | ||
freePruner: A Training-free Approach for Large Multimodal Model Acceleration Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan | 2024/11 | - | ||
FoPru: Focal Pruning for Efficient Large Vision-Language Models Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu | 2024/11 | - | ||
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang | 2024/11 | GitHub Project Page | ||
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang | 2024/11 | GitHub | ||
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo | 2024/11 | - | ||
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang | 2024/11 | GitHub | ||
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni | 2024/11 | GitHub Project Page | ||
Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM SooHwan Eom, JayShim, GwanhyeongKoo, HaebinNa, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo | 2024/11 | - | - | |
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter | 2024/11 | GitHub | ||
Efficient Multi-modal Large Language Models via Visual Token Grouping Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng | 2024/11 | - | ||
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles | 2024/10 | - | ||
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang | 2024/10 | - | ||
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo | 2024/10 | GitHub | ||
LLaVA-Video: Video Instruction Tuning With Synthetic Data Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li | 2024/10 | GitHub Project Page Model | - | |
Video Token Merging for Long-form Video Understanding Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li | 2024/10 | - | ||
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra | 2024/10 | GitHub Project Page Model | ||
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin | 2024/10 | GitHub | ||
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi | 2024/10 | - | ||
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma | 2024/10 | - | ||
PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li | 2024/10 | - | ||
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang | 2024/10 | GitHub Project Page | ||
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning | 2024/10 | GitHub Project Page Model Benchmark | ||
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su | 2024/10 | GitHub | ||
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou | 2024/09 | GitHub Model | ||
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, Bo Zhao | 2024/09 | GitHub | ||
Fit and prune: Fast and training-free visual token pruning for multimodal large language models Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou | 2024/09 | GitHub | ||
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang | 2024/09 | GitHub | ||
NVLM: Open Frontier-Class Multimodal LLMs Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping | 2024/09 | GitHub Project Page Model | ||
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong | 2024/09 | - | ||
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen | 2024/09 | GitHub | ||
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding | 2024/09 | GitHub | ||
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu | 2024/09 | GitHub | ||
| LVP: Language-guide Visual Projector for Efficient Multimodal LLM Anonymous Authors | 2024/09 | - | ||
Instruction Tuning-free Visual Token Complement for Multimodal LLMs Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang | 2024/08 | - | ||
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou | 2024/08 | GitHub Model | ||
LLaVA-OneVision: Easy Visual Task Transfer Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li | 2024/08 | GitHub Project Page Model | - | |
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji | 2024/08 | GitHub | ||
Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang | 2024/08 | - | ||
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning Shibo Jie, Yehui Tang, Jianyuan Guo, Zhi-Hong Deng, Kai Han, Yunhe Wang | 2024/08 | GitHub | ||
Dynamic and Compressive Adaptation of Transformers From Images to Videos Guozhen Zhang, Jingyu Liu, Shengming Cao, Xiaotong Zhao, Kevin Zhao, Kai Ma, Limin Wang | 2024/08 | - | - | |
TokenPacker: Efficient Visual Projector for Multimodal LLM Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang | 2024/07 | GitHub Model | ||
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan | 2024/07 | GitHub | ||
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie | 2024/07 | GitHub | ||
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang | 2024/07 | - | ||
LookupViT: Compressing visual information to a limited number of tokens Rajat Koner, Gagan Jain, Prateek Jain, Volker Tresp, Sujoy Paul | 2024/07 | - | ||
Efficient Large Multi-modal Models via Visual Context Compression Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille | 2024/06 | GitHub | ||
VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing | 2024/06 | GitHub Model Demo | ||
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan | 2024/06 | GitHub | ||
VoCo-LLaMA: Towards Vision Compression with Large Language Models Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang | 2024/06 | GitHub Project Page | ||
Boosting multimodal large language models with visual tokens withdrawal for rapid inference Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji | 2024/05 | GitHub | ||
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou | 2024/05 | GitHub | ||
Matryoshka Query Transformer for Large Vision-Language Models Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang | 2024/05 | GitHub Model Project Page | ||
Matryoshka Multimodal Models Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee | 2024/05 | GitHub Project Page | ||
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, et al. | 2024/04 | GitHub Model | ||
LongVLM: Efficient Long Video Understanding via Large Language Models Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang | 2024/04 | GitHub | ||
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng | 2024/04 | GitHub Project Page | ||
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng | 2024/04 | - | ||
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan | 2024/03 | GitHub Project Page | ||
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, Guiguang Ding | 2024/03 | GitHub | ||
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang | 2024/03 | GitHub | ||
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Vision-Language Tasks Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen | 2024/03 | GitHub | ||
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen | 2024/02 | GitHub Model | ||
Honeybee: Locality-enhanced Projector for Multimodal LLM Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh | 2023/12 | GitHub | ||
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Yanwei Li, Chengyao Wang, Jiaya Jia | 2023/11 | GitHub Project Page Model | ||
Chat-univi: Unified visual representation empowers large language models with image and video understanding Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan | 2023/11 | GitHub Model | ||
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou | 2023/10 | GitHub | ||
PPT: Token Pruning and Pooling for Efficient Vision Transformers Xinjian Wu, Fanhu Zeng, Xiudong Wang, Xinghao Chen | 2023/10 | GitHub | ||
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou | 2023/08 | GitHub | ||
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang | 2023/07 | GitHub Project Page | ||
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan | 2023/06 | GitHub | ||
DiffRate: Differentiable Compression Rate for Efficient Vision Transformers Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo | 2023/05 | GitHub | - | - |
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, Bing Qin | 2023/05 | - | ||
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi | 2023/05 | GitHub | ||
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi | 2023/05 | GitHub | ||
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang | 2023/05 | GitHub Model | ||
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny | 2023/04 | GitHub Project Page | ||
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang | 2023/04 | GitHub | - | - |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi | 2023/01 | GitHub | ||
Token Turing Machines Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab | 2022/11 | GitHub | ||
Token Merging: Your ViT But Faster Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman | 2022/10 | GitHub | ||
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention Xiangcheng Liu, Tianyi Wu, Guodong Guo | 2022/09 | - | - | - |
Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. | 2022/04 | GitHub | ||
EViT: Expediting Vision Transformers via Token Reorganizations Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie | 2022/02 | GitHub | - | - |
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, Eric Xing | 2022/01 | GitHub | - | - |
A-ViT: Adaptive Tokens for Efficient Vision Transformer Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, Pavlo Molchanov | 2021/12 | GitHub Project Page | - | - |
ATS: Adaptive Token Sampling For Efficient Vision Transformers Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall | 2021/11 | GitHub Project Page | - | - |
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, Xing Sun | 2021/08 | GitHub | - | - |
Patch Slimming for Efficient Vision Transformers Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, Dacheng Tao | 2021/06 | - | - | - |
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh | 2021/06 | GitHub Project Page | - | - |
📈 Benchmark(Coming Soon)
We compiled the image and video understanding benchmarks commonly used in token pruning studies, and built a comprehensive evaluation framework based on them. Through our framework, users can evaluate 26 relevant benchmarks (15 image-based and 11 video-based) in a single pass, which helps provide an overview of a method's systemic capabilities.
The dataset and evaluation scripts are ready and will be released here shortly.
📌 Citation
If you find our paper or this resource helpful, please consider cite:
@misc{yao2026towards,
title = {Towards Efficient Multimodal Large Language Models: A Survey on Token Compression},
author = {Yao, Linli and Xing, Long and Shi, Yang and Li, Sida and Liu, Yuanxin and
Dong, Yuhao and Zhang, Yi-Fan and Li, Lei and Dong, Qingxiu and Dong, Xiaoyi and
Huang, Qidong and Wang, Haotian and Wu, Feng and Zhang, Yuanxing and Wan, Pengfei and
Lin, Zhouchen and Sun, Xu},
year = {2026},
month = jan,
howpublished = {TechRxiv},
doi = {10.36227/techrxiv.176823010.07236701/v1},
url = {https://doi.org/10.36227/techrxiv.176823010.07236701/v1}
}
⭐ Star History
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.