Towards Efficient Multimodal Large Language Models: A Survey on Token Compression

June 9, 2026 · View on GitHub

截屏2025-11-26 18 08 22

📢 Contributions Welcome

We appreciate contributions that help improve this repository and the accompanying paper. Please feel free to submit a pull request to:

  1. Add a missing or relevant paper.
  2. Propose a more suitable category or tag.
  3. Update or correct information (links, metadata, status).
  4. Addressing Potential Issues in Benchmarking.
  5. Request clarification or report an issue.

Thank you — every suggestion helps make this resource more useful.

⭐ If you find this repo useful, please give us a star :)

✒️ Table of Contents

  • News: Latest Updates, News, and Announcements.
  • About: Overview and Objectives.
  • Tag Description: Brief Explanation of Tags in Paper Table.
  • Paper Table: Paper Index (by Year, Descending).
  • Benchmark: An overview of our proposed benchmark for MLLM token compression .
  • Citation: If you find this helpful, please consider citing us.

🔥 News

  • [2025.12.18] We've released the first version (v1.0) of the survey, which can be downloaded here.
  • [2025.11.26] We've released the repository!

☀️ About

Multimodal Large Language Models (MLLMs) are rapidly expanding their capabilities, but high-resolution images and long videos create extremely long visual-token streams that dramatically increase compute, memory, and latency requirements. This repository accompanies our survey on Towards Efficient Multimodal Large Language Models: A Survey on Token Compression (Techriv) to help researchers and practitioners navigate this field.

Motivation. Token compression reduces the number of visual tokens processed by MLLMs while preserving critical cross-modal semantics, enabling more efficient training and faster inference without large accuracy regressions. The field is fragmented across encoders, projectors, and LLM-side techniques; a centralized, searchable resource is needed.

Target audience. Researchers, implementers, and system designers working on multimodal models, retrieval, efficient vision-language pipelines, and deployment at scale.

What this repo provides.

  • A curated, searchable, chronologically organized paper index.
  • Short annotated entries with metadata (method family, compression ratio, retrain vs plugin, modality).
  • Links to code, checkpoints, and reproducibility notes where available.
  • An overview of our proposed evaluation suite for MLLM token compression and benchmarks.
  • Contribution guidelines and templates for adding papers or methods.

Feel free to browse the table, open issues, or contribute entries to help grow a rigorous, practical ecosystem for efficient multimodal modeling.

📋 Tag Description

  • arXiv Badge red for arXiv papers
  • PDF Badge blue for conference/journal papers
  • GitHub Badge white for GitHub repositories
  • Research Areas Badge purple for modality
  • Position Badge cyan for compression position
  • Text Query Badge brightgreen for whether it is text query-based
  • Method Badge lightgrey for compression methods: merge or pruning
  • Mode Badge yellow for usage mode: re-train or plug-in
  • Speed Badge orange for acceleration stage: Train stage or Inference Stage
  • Ratio Badge pink for compression ratio: fix or dynamic
  • Train_Infer Badge yellowgreen for usage stage

📚 Paper Table

Title & AuthorsDateLinksModality & PositionTags
arXiv
StateKV: Linear Scaling Video VLMs for Long Video Understanding
Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles
2026/05-Area StageText Query
Method
Approach
Ratio
arXiv
EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision
Rosario Forte, Giuseppe Lando, Antonino Furnari
2026/05-Type-
PDF
VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning
Hengbo Xu, Shengjie Jin, Yanbiao Ma, Zhiwu Lu
2026/05-Area StageText Query
Method
Approach
Ratio
Speed
PDF
EarlyTom: Early Token Compression Completes Fast Video Understanding
Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang
2026/05ProjectArea Stage StageText Query
Method Method
Approach
Ratio
arXiv Star
SEATS: Stage-adaptive Token Selection for Efficient Omni-modal LLMs
Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li
2026/05GitHubArea Area Stage StageText Query
Method
Approach
Ratio
arXiv Star
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models (VIF)
Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang
2026/05GitHubArea StageText Query
Method
Approach
Ratio
arXiv
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong
2026/05ProjectArea StageText Query
Method
Approach
Ratio
arXiv
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao
2026/05-Area StageText Query
Method
Approach
Ratio
arXiv Star
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models (LiteLVLM)
Sangin Lee, Yukyung Choi
2026/05GitHubArea StageText Query
Method
Approach
Ratio
arXiv
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao
2026/05-Area Stage StageText Query
Method
Approach
Ratio
PDF Star
ETCTrack: An Efficient Token Compression Framework for Visual Object Tracking
Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song
2026/05GitHubArea StageText Query
Method
Approach
Ratio
arXiv
SAVEMem: Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
Hang Wu, Sherin Mary Mathews, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang
2026/05-Area StageText Query
Method
Approach
Ratio
arXiv Star
VLMaxxing through FrameMogging: Training-Free Anti-Recomputation for Video Vision-Language Models
JF Bastien, Sam D'Amico
2026/05GitHubArea StageText Query
Method
Approach
Ratio
PDF
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
Rinyoichi Takezoe, Yi Li, Zhe Bo, Atsuhiro Hou, Guang Meng, Keisuke Long
2026/04-Area Stage StageText Query
Method
Approach
Speed
Ratio
arXiv Star
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs (SToP)
Kibum Kim, Joonhwan Kim, Kyoungmin Min, Yuxuan Wang, Jeonghun Moon, Julian McAuley, Chanwoo Park
2026/04GitHubArea StageText Query
Method
Approach
Speed
Ratio
arXiv
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao
2026/04-Area Area StageText Query
Method
Approach
Ratio
PDF
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
Jiatao Qu, Fengwei Zhou, Wenjing Li, Tao Wu, Guanxiong Xue, Zhicheng Zhao, Dong Wei, Yu Lu, Byunghan Na
2026/04-Area StageText Query
Method
Approach
Speed
Ratio
arXiv Star
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding (XComp)
Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang
2026/04GitHubArea StageText Query
Method
Approach
Ratio
PDF Star
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Simin Huo, Ni Li
2026/04GitHubArea Area StageText Query
Method
Approach
Speed
Ratio
arXiv
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir
2026/04ProjectArea StageText Query
Method
Approach
Ratio
arXiv Star
Tango: Taming Visual Signals for Efficient Video Large Language Models
Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen
2026/04GitHubArea Stage StageText Query
Method Method
Approach
Ratio
PDF
Do Vision Language Models Need to Process Image Tokens?
Soumya Ghosh, R.V. Babu, Chirag Agarwal
2026/04-Area StageText Query
Method
Approach
Ratio
arXiv
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
2026/04ProjectArea StageText Query
Method
Approach
Ratio
arXiv
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Qihui Zhu, Tao Zhang, Yuchen Wang, Zijian Wen, Mengjie Zhang, Shuangwu Chen, Xiaobin Tan, Jian Yang, Yang Liu, Zhenhua Dong, Xianzhi Yu, Yinfei Pan
2026/04-Area StageText Query
Method
Approach
Speed
Ratio
arXiv
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
Zhaohong Huang, Wei Liu, Yuxin Zhang, Fei Chao, Rongrong Ji
2026/04-Area StageText Query
Method
Approach
Speed
Ratio
arXiv
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs
Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko
2026/04-Area Area StageText Query
Method
Approach
Ratio
PDF Star
GroundVTS: Visual Token Sampling in MLLMs for Video Temporal Grounding
Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai, Zhao Yang
2026/04GitHubArea Stage StageText Query
Method
Approach
Ratio
arXiv Star
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
Yanjie Zhou, Yuxin Zhang, Jun Chen et al.
2026/04GitHubArea StageText Query
Method
Approach
Ratio
PDF
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Junhao Du et al.
2026/03-Area Stage StageText Query
Method Method
Approach
Ratio
arXiv
AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Haozhe Qi, Kevin Qu, Mahdi Rad et al.
2026/03ProjectArea StageText Query
Method
Approach
Ratio
arXiv
SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering
Wenli Li, Kai Zhao, Haoran Jiang et al.
2026/03-Area Area StageText Query
Method
Approach
Ratio
arXiv
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
2026/03-Area Stage StageText Query
Method
Approach
Ratio
arXiv
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs (AwaRes)
Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
2026/03ProjectArea StageText Query
Method
Approach
Ratio
arXiv
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Sijie Li, Biao Qian, Jungong Han
2026/03-Area StageText Query
Method
Approach
Ratio
arXiv
Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs
Jaehoon Lee, Mingi Jung, Soohyuk Jang, Seungryong Yoo, Dahuin Jung, Sungroh Yoon
2026/03-Area StageText Query
Method
Approach
Ratio
arXiv Star
Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models
Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang
2026/03GitHubArea StageText Query
Method
Approach
Ratio
arXiv Star
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression
Bingzhou Li, Tao Huang
2026/03GitHubArea Area StageText Query
Method
Approach
Ratio
arXiv
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
Surendra Pathak, Bo Han
2026/03-Area StageText Query
Method Method
Approach
Ratio
PDF
AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing
Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin
2026/03ProjectArea Area StageText Query
Method
Approach
Ratio
arXiv Star
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
2026/03GitHubArea Area StageText Query
Method
Approach
Ratio
arXiv
E-AdaPrune: Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
Jialuo He, Huangxun Chen
2026/03-Area StageText Query
Method
Approach
Ratio
arXiv
SemVID: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
Jiaqi Li, Shuntian Zheng, Yixian Shen, Xiangru Jian, Zhiqi Li, Yuncheng Li
2026/03-Area StageText Query
Method
Approach
Ratio
arXiv
EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs
Yuhao Chen, Bin Shan, Xin Ye, Shu Wang, Jiashu Zhang
2026/03-Area Area StageText Query
Method
Approach
Ratio
arXiv
AOT: Token Reduction via Local and Global Contexts Optimization for Efficient Video LLMs
Jinlong Li, Xinyu Li, Trong-Tung Nguyen, Yong Jae Lee, Nicu Sebe
2026/03ProjectArea StageText Query
Method
Approach
Ratio
PDF Star
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs
Changwoo Baek, Jouwon Song, Seunghun Lee, Jong-Ok Kim
2026/03GitHubArea StageText Query
Method
Approach
Ratio
arXiv
TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
Zhuo Chen, Shawn Young, Di Liu, Yuan Zhang, Wentao Zheng, Yilin Jia, Lijian Xu
2026/03CodeArea StageText Query
Method
Approach
Ratio
arXiv
Stateful Token Reduction for Long-Video Hybrid VLMs
Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu
2026/03-Area StageText Query
Method
Approach
Ratio
PDF Star
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
Mouxiao Huang, Borui Jiang, Dehua Zheng, Hailin Hu, Kai Han, Xinghao Chen
2026/02GitHubArea Area
Stage Stage
Text Query
Method Method
Approach Approach
Ratio Train_Infer Train_Infer
PDF Star
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen
2026/02GitHubArea StageText Query
Method
Approach
Ratio
arXiv
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
2026/02-Area Stage-
PDF Star
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport
Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang
2026/02GitHubArea StageText Query
Method
Approach
Ratio
PDF Star
ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng
2026/02GitHubArea Area StageText Query
Method
Approach
Ratio
PDF Star
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Aditya Kumar Singh, Hsin-Pai Cheng, Sairam Sundaresan, Bhavin Jawade, Karthikeyan Saravanan, Saransh Rajput, Jinjun Xiong, Syed Zawad, Elias B. Khalil, Vijaykrishnan Narayanan
2026/02GitHubArea Area Stage StageText Query
Method Method
Approach
Ratio
arXiv Star
EntropyPrune: Matrix Entropy Guided Visual Token Pruning for MLLMs
Yahong Wang, Juncheng Wu, Zhangkai Ni, Chengmei Yang, Yihang Liu, Longzhen Yang
2026/02GitHubArea Area StageText Query
Method
Approach
Ratio
arXiv
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu, Jianchen Zhu
2026/02-Area StageText Query
Method
Approach
Ratio
arXiv
Vision Token Reduction via Attention-Driven Self-Compression for Efficient MLLMs
Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan
2026/02-Area StageText Query
Method
Approach
Ratio
arXiv
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun
2026/02-Area StageText Query
Method Method
Approach
Ratio
arXiv
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
Xiangtian Zheng, Zishuo Wang, Yuxin Peng
2026/02-Area StageText Query
Method Method
Approach
Ratio
PDF Star
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian
2026/02GitHubArea StageText Query
Method Method
Approach
Ratio
arXiv Star
Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
2026/02GitHubArea StageText Query
Method Method
Approach
Ratio
arXiv
When LLaVA Meets Objects: Token Composition for Vision-Language-Models
Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne
2026/02-Area StageText Query
Method
Approach
Ratio
arXiv
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
2026/02-Area Area StageText Query
Method
Approach
Ratio
arXiv Star
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li
2026/02GitHubArea StageText Query
Method
Approach
Ratio
arXiv Star
Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen
2026/02GitHubArea StageText Query
Method
Approach
Ratio
arXiv
KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
Baiyang Song, Jun Peng, Yuxin Zhang, Guangyao Chen, Feidiao Yang, Jianyuan Guo
2026/02-Area Stage StageText Query
Method
Approach
Ratio
arXiv
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
Chen Qian, Xinran Yu, Danyang Li, Guoxuan Chi, Zheng Yang, Qiang Ma
2026/02-Area StageText Query
Method
Approach
Ratio
PDF Star
IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu
2026/02GitHubArea StageText Query
Method
Approach
Ratio
arXiv
Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
Yihong Huang, Fei Ma, Yihua Shao, Jingcai Guo, Zitong Yu, Laizhong Cui
2026/02-Area Stage StageText Query
Method Method
Approach
Ratio
PDF Star
CaCoVID: Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
Yinchao Ma, Qiang Zhou, Zhibin Wang, Zhen Song, Jionglong Su
2026/02GitHubArea StageText Query
Method
Approach
Ratio
arXiv
Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models
Yihang Rao et al.
2026/01-Area Stage StageMethod
Method
arXiv
Environment-Aware Adaptive Pruning with Interleaved Inference Orchestration for Vision-Language-Action Models
Yuting Huang, Leilei Ding, Zhipeng Tang, Zenghuan Zhu, Jiajun Deng
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching
Yujie Wei, Jiahan Fan, Jiyu Guo, Ruichen Zhen, Rui Shao, Xiu Su, Zeke Xie, Shuo Yang
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models
Samyak Jha, Junho Kim
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning
Xiaoshu Chen, Sihang Zhou, Ke Liang, Xinwang Liu
2026/01-Area StageText Query
Method
Approach
Ratio
PDF Star
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu
2026/01GitHubArea Area StageText Query
Method Method
Approach
Ratio
arXiv
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
Efficient Token Pruning for LLaDA-V
Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang
2026/01-Area StageText Query
Method
Approach
Ratio
PDF Star
Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, Xudong Jiang
2026/01GitHubArea StageText Query
Method
Approach
Ratio
arXiv
ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning
Wen Luo, Peng Chen, Xiaotao Huang, LiQun Huang
2026/01-Area Area Stage StageText Query
Method
Approach
Ratio
arXiv
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang
2026/01-Area StageText Query
Method
Approach
Ratio
arXiv
FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference
Chaeyoung Jung, Youngjoon Jang, Seungwoo Lee, Joon Son Chung
2026/01-Area Area StageText Query
Method
Approach
Ratio
arXiv
Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression
Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu
2026/01-Area StageText Query
Method
Approach
Ratio
PDF Star
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
2026/01GitHubArea Stage StageText Query
Method
Approach
Ratio
PDF Star
STC: Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang
2025/12GitHubArea Stage StageText Query
Method
Approach
Ratio
arXiv
TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts
Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin
2025/12-Area StageText Query
Method
Approach
Ratio
arXiv
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang
2025/12-Area Area StageText Query
Method
Approach
Ratio
arXiv Star
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu
2025/10GitHub
--
PDF Star
Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, Xuming Hu
2025/10GitHubArea Area StageText Query
Method
Approach
Speed
PDF Star
ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim
2025/09GitHubArea StageText Query
Method
Approach
Speed
Ratio
arXiv
Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
2025/09-Area StageText Query
Method
Approach
Speed
arXiv Star
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
Xiangchen Wang, Jinrui Zhang, Teng Wang, Haigang Zhang, Feng Zheng
2025/09GitHub
Model
Area StageText Query
Approach
Speed
arXiv Star
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen
2025/09GitHubArea Area StageText Query
Method
Approach
Speed
arXiv
TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models
Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin
2025/09-Area Area StageText Query
Method
Approach
Speed
Ratio
arXiv
Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin
2025/08---
arXiv
When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models
Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang
2025/08-Area StageText Query
Method
Approach
Speed
Ratio
arXiv
AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu
2025/08-Area StageText Query
Method
Approach
Speed
Ratio
arXiv
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin
2025/08-Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
Publish Star
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
2025/08GitHub
Area Area
Stage Stage
Text Query
Method Method
Approach Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv Star
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
2025/08GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
arXiv
MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang
2025/08-Area-
arXiv
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Kang Zeng, Guojin Zhong, Jintao Cheng, Jin Yuan, Zhiyong Li
2025/08---
arXiv
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang
2025/08---
arXiv
PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models
Kai Zhao, Wubang Yuan, Alex Lingyu Hung, Dan Zeng
2025/08---
arXiv Star
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
2025/08GitHub
Project Page
--
PDF Star
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
2025/08GitHubArea StageText Query
Method Method
Approach
Speed
arXiv Star
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, et al.
2025/08GitHub
Model
Area Area-
arXiv Star
LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang
2025/08GitHub
Area Area-
KDD
ADMIRE: ADaptive method to enhance Multiple Image REsolutions in text-rich multi-image understanding
Qipeng Zhu, Xiong Wang, Zhihong Lu, Jiangwei Lao, Congyun Jin, Jie Chen, Yingzhe Peng, Qi Zhu, Lianzhen Zhong, Jiajia Liu, Peng Wei, Jian Wang
2025/08-Area Area Stage StageText Query
Method
Approach Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang
2025/08-Area Stage StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv Star
HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models
Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen
2025/08GitHubArea StageText Query
Method
Approach
Speed
Ratio
Train_Infer
ICCV Star
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
2025/07GitHub
Area Stage Stage StageText Query
Method
Approach
Speed
Ratio
arXiv
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata
2025/07-AreaMethod
Approach
Speed
arXiv Star
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
2025/07GitHub
Area StageText Query
Method
Approach
Speed
Ratio
arXiv Star
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
2025/07GitHub--
arXiv
Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI
Phat Nguyen, Ngai-Man Cheung
2025/07---
ACM MM Star
Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers
Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
2025/07GitHub
AreaText Query
Method
Approach
Speed
Ratio
ICCV Star
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
2025/07GitHub
Project Page
Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
2025/07GitHub
Model
Area StageText Query
Approach
Speed
Ratio
Train_Infer
arXiv
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent
Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen
2025/07-Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
Beyond Token Pruning: Operation Pruning in Vision-Language Models
Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer
2025/07GitHubArea StageText Query
Method
Ratio
Train_Infer
arXiv
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models
Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
2025/07-Area Area StageText Query
Method
Approach
Speed
Ratio
arXiv
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment
Rui Xu, Yunke Wang, Yong Luo, Bo Du
2025/06-Area Stage StageText Query
Method Method
Approach
Speed
Ratio
arXiv Star
Learning Compact Vision Tokens for Efficient Large Multimodal Models
Hao Tang, Chengchao Shen
2025/06GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
NeurIPS Star
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
2025/06GitHub
Project Page
Image Video-
NeurIPS Star
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
2025/06GitHub
Project Page
Model
--
arXiv Star
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
2025/06GitHub
Area Stage StageText Query
Method Method
Approach
Speed Speed
Ratio
arXiv
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu
2025/06-Area Area StageText Query
Method
Approach
Speed
ICCV Star
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
2025/06GitHub
Project Page
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
ICCV Star
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin
2025/06GitHub
Project Page
Model
Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
2025/06GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv
GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models
Ruiguang Pei, Weiqing Sun, Zhihui Fu, Jun Wang
2025/06-Area StageText Query
Method Method
Approach
Speed
Ratio
arXiv
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang
2025/06-Area StageText Query
Method
Approach
Speed
Ratio
arXiv Star
Seed1.5-VL Technical Report
ByteDance Seed
2025/05GitHub
Demo
Homepage
Area Area
Stage
Text Query
Method
Approach
Speed
Speed
Ratio
arXiv Star
Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
2025/05GitHub
--
ICML Star
Corematching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive acceleration of vision-language models
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
2025/05GitHub
Area Area StageText Query
Method
Approach
Speed
Ratio
NeurIPS Star
Balanced token pruning: Accelerating vision language models beyond local optimization
Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
2025/05GitHub
Area StageText Query
Method
Approach
Speed
arXiv
ToDRE: Visual Token Pruning via Diversity and Task Relevance for Multimodal LLMs
Duo Li, Zuhao Yang, Shijian Lu
2025/05-Area Area Stage StageText Query
Method
Approach
Speed
Ratio
arXiv Star
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
2025/05GitHub
Project Page
Area Stage StageText Query
Method
Approach
Speed
Ratio
arXiv Star
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
2025/05GitHub
Area Stage StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
arXiv
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models
Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding
2025/05-Area Stage StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang
2025/05GitHub
Area Stage StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv
Clapper: Compact Learning and Video Representation in VLMs
Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang
2025/05-Area Stage StageText Query
Method Method
Ratio
Train_Infer
arXiv Star
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang
2025/05GitHub
Area StageText Query
Method
Ratio
Train_Infer
arXiv
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang
2025/05-Area StageMethod
Method
Approach
Speed
NeurIPS
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naïve Integration via Multi-Objective Balanced Covering
Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu
2025/05---
ECAI
Lossless Token Merging Even Without Fine-Tuning in Vision Transformers
Jaeyeon Lee, Dong-Wan Choi
2025/05-Stage-
arXiv
Token Sequence Compression for Efficient Multimodal Computing
Yasmine Omri, Parth Shroff, Thierry Tambe
2025/04-Area StageText Query
Method
Approach
Speed
Ratio
arXiv
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng
2025/04-Area StageText Query
Method
Ratio
Train_Infer
ACL+2025+findings Star
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models
Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polisetty, Yi R. Fung, Paul Pu Liang
2025/04GitHub
-Stage
arXiv Star
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia
2025/04GitHub
Area StageText Query
Method
Approach
Speed
Speed
Ratio
Train_Infer
ACMMM Star
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
2025/04GitHub
Project Page
Dataset
Model
Area StageText Query
Method
Approach Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv Star
DYMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
2025/04GitHub
Project Page Model
Area Area Stage StageText Query
Method Method
Approach
Speed
Ratio
arXiv Star
Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua
2025/04GitHub
Project Page
Area StageText Query
Method
Method
Ratio
Train_Infer Train_Infer
CVPR Star
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou
2025/04GitHub
Area Area StageText Query
Method
Method
Approach
Speed
Ratio
Train_Infer
arXiv
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang
2025/04-Area StageText Query
Method
Ratio
Train_Infer
NeurIPS Star
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
2025/03GitHub
Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li
2025/03GitHub
--
arXiv
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Saket Gurukar, Asim Kadav
2025/03-Area StageText Query
Method
Approach
Speed
Ratio
arXiv Star
Tokencarve: Information-preserving visual token compression in multimodal large language models
Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen
2025/03GitHub
--
ICCV Star
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan
2025/03GitHub
Area Stage StageText Query
Method Method
Approach
Speed Speed
Ratio
CVPR
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan
2025/03-Area Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao
2025/03GitHub
Model
Area Stage Stage StageText Query
Method
Ratio
Train_Infer Train_Infer
arXiv Star
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Model
Haichao Zhang, Yun Fu
2025/03GitHub
Project Page
Model
Area StageText Query
Ratio
Train_Infer
CVPR Star
HICom: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie
2025/03GitHub
Model
Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
Similarity-Aware Token Pruning: Your VLM but Faster
Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati
2025/03GitHub
Area Stage StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao
2025/03GitHub
Area Stage StageText Query
Method
Approach
Speed Speed
arXiv
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
2025/03Project PageArea StageText Query
Method Method
Ratio
Train_Infer Train_Infer
CVPR Star
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
2025/03GitHub
Area Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
ICCV Star
When LVLM Meets Large RS Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li
2025/03GitHub
Dataset
--
arXiv
Silent Hazards of Token Reduction in Vision-Language Models
Yizheng Sun, Hao Li, Chang Xu, Hongpeng Zhou, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun
2025/03---
arXiv Star
Prune and Merge: Efficient Token Compression For Vision Transformer With Spatial Information Preserved
Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua
2025/03GitHub
Area StageText Query
Method Method
Ratio
Train_Infer
arXiv Star
Qwen2.5-VL Technical Report
QwenTeam
2025/02GitHub
Model
Area Area StageText Query
Method
Approach
Speed
arXiv
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models
Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang
2025/02---
EMNLP Star
Stop Looking for Important Tokens in Multimodal Language Models for Token Pruning
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
2025/02GitHub
Area StageText Query
Method
Approach
Speed
arXiv
FCoT-VL: Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
Jianjian Li, Junquan Fan, Feng Tang, Gang Huang, Shitao Zhu, Songlin Liu, Nian Xie, Wulong Liu, Yong Liao
2025/02-Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
ACL Findings
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang
2025/02---
arXiv Star
Beyond Token Compression: A Training-Free Reduction Framework for Efficient Visual Processing in MLLMs
Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin
2025/01GitHub
Area StageText Query
Ratio
Train_Infer
ICCV Star
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal LLMs via Visual Registers
Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, Liqiang Nie
2025/01GitHub
Project Page
Area StageText Query
Method
Approach
Speed
NAACL+2025+findings
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro
2025/01---
ICCV Star
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, Enzo Tartaglione
2025/01GitHub
--
arXiv Star
Compression with Global Guidance: Towards Training-Free High-Resolution MLLMs Acceleration
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen
2025/01GitHub
--
AAAI Star
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
2025/01GitHub
Area StageText Query
Method
Approach
Speed
arXiv Star
DyRate: Dynamic Token Reduction during Generation for Vision Language Models
Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji Hu
2025/01GitHub
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
2025/01GitHub
Area-
arXiv
AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng
2025/01-Area StageText Query
Method
Ratio
Train_Infer
arXiv Star
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen
2025/01GitHub
--
ICLR Star
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
2025/01GitHub
Model
Area Area StageText Query
Method
Ratio
Train_Infer
ICCV Star
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
2025/01GitHub
Project Page
Area StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang
2025/01GitHub
Demo
Model
Area Stage StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Redundancy Modeling
Ke Wang, Hong Xuan
2024/12-Area StageText Query
Method
Approach
Speed
Speed
Publish Star
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
2024/12GitHub
Area Area StageMethod
Method
Approach
Speed
arXiv Star
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao
2024/12GitHub
Area StageText Query
Method Method
Approach
Speed
Ratio
CVPR
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
2024/12-Area StageText Query
Method
Approach
Speed
AAAI
St3: Accelerating multimodal large language model by spatial-temporal visual token trimming
Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu
2024/12-Area StageText Query
Method
Approach
Speed
arXiv Star
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng
2024/12GitHub
Area Area Stage Stage-
ICCV Star
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
2024/12GitHub
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, Kai Han
2024/12GitHub
Project Page
Area Stage StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
RETAKE: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
2024/12GitHub
Area-
ICLR Star
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification.
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, Shaohui Lin
2024/12GitHub
Model-7B
Model-13B
Area StageText Query
Approach
Speed
CVPR Star
FastVLM: Efficient Vision Encoding for Vision Language Models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
2024/12GitHub
Area StageText Query
Method
Approach
Speed
Speed
CVPR Star
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
2024/12GitHub
Model
Area Area Stage StageText Query
Method
Approach
Speed
Publish Star
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
2024/12GitHub
Area-
CVPR Star
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
2024/12GitHub
Demo
Area Area Stage StageText Query
Method
Method
Ratio
Train_Infer Train_Infer
ICCV Star
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, Limin Wang
2024/12GitHub
Model
Area StageText Query
Method
Approach
Speed
Speed
Ratio
Train_Infer Train_Infer
ICCV Star
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
2024/12GitHub
Project Page
Area Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
CVPR
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
2024/12Project PageArea StageText Query
Method
Approach
Speed
Ratio
Train_Infer
CVPR
Accelerating multimodal large language models by searching optimal vision token reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
2024/12-Area Area StageText Query
Method
Approach
Speed
arXiv
Token Cropr: Faster ViTs for Quite a Few Tasks
Benjamin Bergner, Christoph Lippert, Aravindh Mahendran
2024/12-Area StageText Query
Method
Ratio
Train_Infer
Publish Star
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji
2024/11GitHub
Area StageText Query
Method
Approach
Speed
arXiv
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan
2024/11-Area StageText Query
Method
Approach
Speed
arXiv
FoPru: Focal Pruning for Efficient Large Vision-Language Models
Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu
2024/11-Area StageText Query
Method
Approach
Speed
arXiv Star
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration
Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
2024/11GitHub
Project Page
Area Area
Stage Stage
Text Query
Method
Approach
Speed
Ratio
Train_Infer
CVPR Star
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
2024/11GitHub
Area Stage StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
arXiv
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo
2024/11-Area Stage StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang
2024/11GitHub
Area Area Stage StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
NeurIPS Star
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni
2024/11GitHub
Project Page
Area StageText Query
Ratio
Train_Infer
EMNLP Findings
Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM
SooHwan Eom, JayShim, GwanhyeongKoo, HaebinNa, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
2024/11-Stage-
ICLR Star
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
2024/11GitHub
Area StageText Query
Method
Ratio
Train_Infer
arXiv
Efficient Multi-modal Large Language Models via Visual Token Grouping
Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng
2024/11-Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
2024/10-Area StageText Query
Approach
Speed
Speed
ICLR
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
2024/10-AreaSpeed
Publish Star
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation
Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo
2024/10GitHub
Area StageText Query
Method Method
Approach
Speed
arXiv Star
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
2024/10GitHub
Project Page Model
Area-
NeurIPS
Video Token Merging for Long-form Video Understanding
Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li
2024/10-Area StageText Query
Method
Ratio
Train_Infer
ICML Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra
2024/10GitHub
Project Page
Model
Area Stage StageText Query
Method Method
Ratio
Train_Infer
CVPR Star
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin
2024/10GitHub
Area Area StageText Query
Method
Approach
Speed
Speed
Ratio
Train_Infer
Train_Infer
arXiv
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi
2024/10-Area StageText Query
Method
Ratio
Train_Infer
arXiv
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma
2024/10-Area StageText Query
Method Method
Ratio
Train_Infer
arXiv
PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li
2024/10-Area Stage StageText Query
Method Method
Ratio
Train_Infer
ICML Star
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
2024/10GitHub
Project Page
Area Area StageText Query
Method Method
Approach
Speed
Ratio
ICLR Star
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning
2024/10GitHub
Project Page
Model
Benchmark
Area Area StageText Query
Method
Ratio
Train_Infer
ACL Star
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su
2024/10GitHub
Area StageText Query
Method Method
Ratio
Train_Infer
ACL Star
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
2024/09GitHub
Model
Area StageText Query
Method
Ratio
Train_Infer
CVPR Star
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, Bo Zhao
2024/09GitHub
Area StageText Query
Method
Approach
Speed
AAAI Star
Fit and prune: Fast and training-free visual token pruning for multimodal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou
2024/09GitHub
Area StageText Query
Method
Approach
Speed
COLING Star
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
2024/09GitHub
Area Area StageText Query
Method
Method
Ratio
Train_Infer Train_Infer
arXiv Star
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
2024/09GitHub
Project Page
Model
Area StageText Query
Method
Approach
Speed
Speed
arXiv
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration
Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong
2024/09-Area StageText Query
Method
Ratio
Train_Infer
AAAI Star
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
2024/09GitHub
Area StageText Query
Method
Ratio
Train_Infer
ICLR Star
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding
2024/09GitHub
Area StageText Query
Method
Ratio
Train_Infer
AAAI Star
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu
2024/09GitHub
Area StageText Query
Method Method
Ratio
Train_Infer Train_Infer
LVP: Language-guide Visual Projector for Efficient Multimodal LLM
Anonymous Authors
2024/09-Area Area StageText Query
Method
Method
Ratio
Train_Infer
ECCV Star
Instruction Tuning-free Visual Token Complement for Multimodal LLMs
Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang
2024/08-Area StageText Query
Approach
Speed
arXiv Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
2024/08GitHub
Model
Area Area
Stage
Text Query
Method
Approach
Speed
Speed
Ratio
Train_Infer
arXiv Star
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
2024/08GitHub
Project Page
Model
Area Area-
AAAI Star
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
2024/08GitHub
Area StageText Query
Method
Ratio
Train_Infer
arXiv
Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang
2024/08-Area StageText Query
Method
Approach
Speed
Speed
ECCV Star
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
Shibo Jie, Yehui Tang, Jianyuan Guo, Zhi-Hong Deng, Kai Han, Yunhe Wang
2024/08GitHub
Area StageText Query
Approach
Speed
Speed
arXiv
Dynamic and Compressive Adaptation of Transformers From Images to Videos
Guozhen Zhang, Jingyu Liu, Shengming Cao, Xiaotong Zhao, Kevin Zhao, Kai Ma, Limin Wang
2024/08-Area-
IJCV Star
TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
2024/07GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
arXiv Star
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
2024/07GitHub
Area StageText Query
Method
Approach
Speed
arXiv Star
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie
2024/07GitHub
Area StageText Query
Method Method
Approach Approach
Speed
CVPR
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang
2024/07-Area StageText Query
Method
Approach
Speed
Ratio
ECCV
LookupViT: Compressing visual information to a limited number of tokens
Rajat Koner, Gagan Jain, Prateek Jain, Volker Tresp, Sujoy Paul
2024/07-Area Area StageText Query
Method
Approach
Speed
Ratio
NeurIPS Star
Efficient Large Multi-modal Models via Visual Context Compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille
2024/06GitHub
Area Area StageText Query
Method
Approach
Speed
Speed
Ratio
arXiv Star
VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
2024/06GitHub
Model
Demo
Area StageText Query
Method
Approach
Speed
Ratio
Publish Star
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan
2024/06GitHub
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
CVPR Star
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang
2024/06GitHub
Project Page
Area Area StageText Query
Approach
Speed
Ratio
AAAI Star
Boosting multimodal large language models with visual tokens withdrawal for rapid inference
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji
2024/05GitHub
Area StageText Query
Method
Approach
Speed Speed
Ratio
Train_Infer
arXiv Star
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou
2024/05GitHub
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
NeurIPS Star
Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang
2024/05GitHub
Model
Project Page
Area StageText Query
Method
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
ICLR Star
Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
2024/05GitHub
Project Page
Area Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, et al.
2024/04GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
ECCV Star
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang
2024/04GitHub
Area StageText Query
Method
Approach
Speed
arXiv Star
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng
2024/04GitHub
Project Page
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng
2024/04-Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
ICCV Star
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan
2024/03GitHub
Project Page
Area Area StageText Query
Method Method
Approach Approach
Speed
Ratio
Train_Infer Train_Infer
arXiv Star
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, Guiguang Ding
2024/03GitHub
Area StageText Query
Method
Approach
Speed Speed
Ratio
Train_Infer Train_Infer
arXiv Star
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
2024/03GitHub
Area Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer
CVPR 2024 Star
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Vision-Language Tasks
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
2024/03GitHub
Area StageText Query
Method
Approach
Speed
Train_Infer Train_Infer
arXiv Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen
2024/02GitHub
Model
Area StageText Query
Method
Approach
Speed
Ratio
Train_Infer Train_Infer
CVPR 2024 Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
2023/12GitHub
Area StageText Query
Method Method
Approach
Speed
Ratio
Train_Infer
ECCV Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li, Chengyao Wang, Jiaya Jia
2023/11GitHub
Project Page
Model
Area Area Stage StageText Query
Method
Approach
Speed
Speed
Ratio
Train_Infer
CVPR 2024 Star
Chat-univi: Unified visual representation empowers large language models with image and video understanding
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan
2023/11GitHub
Model
Area Area StageText Query
Method
Ratio
Train_Infer
EMNLP+2023+findings Star
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou
2023/10GitHub
Area StageText Query
Method
Approach
Speed
Speed
Ratio
arXiv Star
PPT: Token Pruning and Pooling for Efficient Vision Transformers
Xinjian Wu, Fanhu Zeng, Xiudong Wang, Xinghao Chen
2023/10GitHub
Area StageText Query
Method Method
Ratio
Train_Infer
arXiv Star
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou
2023/08GitHub
Area StageText Query
Method
Approach
Speed
Speed
CVPR Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
2023/07GitHub
Project Page
Area StageText Query
Method
Approach
Speed
ACL 2024 Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
2023/06GitHub
Area StageText Query
Method
Approach
Speed
Speed
arXiv Star
DiffRate: Differentiable Compression Rate for Efficient Vision Transformers
Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo
2023/05GitHub
--
COLING 2024
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, Bing Qin
2023/05-Area StageText Query
Method
Approach
Speed
ACL 2023 Star
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi
2023/05GitHub
AreaText Query
Method Method
Approach
Speed Speed
NeurIPS 2023 Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
2023/05GitHub
Area StageText Query
Method
Approach
Speed
ICML 2024 Star
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang
2023/05GitHub
Model
Area Stage StageText Query
Method
Approach
Speed
arXiv Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
2023/04GitHub
Project Page
Area StageText Query
Method
Approach
Speed
arXiv Star
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang
2023/04GitHub
--
ICML 2023 Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
2023/01GitHub
Area StageText Query
Method
Approach
Speed
CVPR 2023 Star
Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
2022/11GitHub
Area Area StageText Query
Speed
Speed
ICLR 2023 Star
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman
2022/10GitHub
Image Early ViT CompressionText Query
Method
Approach Approach
Speed
arXiv
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention
Xiangcheng Liu, Tianyi Wu, Guodong Guo
2022/09---
NeurIPS 2022 Star
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al.
2022/04GitHubArea StageText Query
Method
Approach
Speed
arXiv Star
EViT: Expediting Vision Transformers via Token Reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie
2022/02GitHub
--
arXiv Star
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space
Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, Eric Xing
2022/01GitHub
--
arXiv Star
A-ViT: Adaptive Tokens for Efficient Vision Transformer
Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, Pavlo Molchanov
2021/12GitHub
Project Page
--
arXiv Star
ATS: Adaptive Token Sampling For Efficient Vision Transformers
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall
2021/11GitHub
Project Page
--
AAAI 2023 Star
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, Xing Sun
2021/08GitHub
--
arXiv
Patch Slimming for Efficient Vision Transformers
Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, Dacheng Tao
2021/06---
arXiv Star
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh
2021/06GitHub
Project Page
--

📈 Benchmark(Coming Soon)

We compiled the image and video understanding benchmarks commonly used in token pruning studies, and built a comprehensive evaluation framework based on them. Through our framework, users can evaluate 26 relevant benchmarks (15 image-based and 11 video-based) in a single pass, which helps provide an overview of a method's systemic capabilities.

The dataset and evaluation scripts are ready and will be released here shortly.

📌 Citation

If you find our paper or this resource helpful, please consider cite:

@misc{yao2026towards,
  title        = {Towards Efficient Multimodal Large Language Models: A Survey on Token Compression},
  author       = {Yao, Linli and Xing, Long and Shi, Yang and Li, Sida and Liu, Yuanxin and
                  Dong, Yuhao and Zhang, Yi-Fan and Li, Lei and Dong, Qingxiu and Dong, Xiaoyi and
                  Huang, Qidong and Wang, Haotian and Wu, Feng and Zhang, Yuanxing and Wan, Pengfei and
                  Lin, Zhouchen and Sun, Xu},
  year         = {2026},
  month        = jan,
  howpublished = {TechRxiv},
  doi          = {10.36227/techrxiv.176823010.07236701/v1},
  url          = {https://doi.org/10.36227/techrxiv.176823010.07236701/v1}
}

⭐ Star History

Star History Chart


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.