README.md
March 16, 2026 · View on GitHub
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
Mouxiao Huang*, Borui Jiang*✉, Dehua Zheng, Hailin Hu✉, Kai Han, Xinghao Chen✉
* Equal contribution
⭐ If you find this useful, a star would be appreciated
🗺 Roadmap
-
✅ Paper accepted at ICLR 2026
-
✅ Core PPE implementation released
-
✅ Training and inference pipeline
-
🔜 Additional benchmark support
-
🔜 Cascade compression for image inputs
-
🔜 Finetuned checkpoints
-
🔜 Extended backbone support
-
🔜 HuggingFace integration
📑 Table of Contents
- 🌟 Highlights
- 📦 Installation
- 🚀 Usage
- 📊 Benchmarks
- 📌 Citation
- ❓ FAQ
- 🤝 Contributing
- 📬 Contact
- ⚖️ License
- 🙏 Acknowledgements
🌟 Highlights
- Plug-and-Play & Parameter-Free: Works in a plug-and-play manner, without modifying original token selection or aggregation mechanisms.
- Preserve Positional Information: Preserves richer positional cues under the same reduction ratio.
- Training-Free & SFT: Mainly depends on the underlying compression method and can be better when training is allowed.
- Broad Compatibility: Easily combines with various token compression methods.
- Cascade Clustering Support: Facilitates multi-stage compression within LLM while maintaining performance.
📦 Installation
pip install -r requirements.txt
Ascend NPU: uncomment torch_npu in requirements.txt, then run the same command.
🚀 Usage
Pretrained Model
We conduct experiments primarily on Qwen2.5-VL-3B-Instruct. You can download the official pretrained model from here.
📁 Dataset
Training Data:
Due to computational limitations, our supervised fine-tuning (SFT) dataset is constructed from public sources:
- LLaVA-Video-178K: 120K sampled instances
- LLaVA-OneVision: 300K sampled instances
You may use the whole datasets or customize your own.
Data structure please refer to ./data/demo.json or Qwen-VL-Series-Finetune for more information.
Evaluation Benchmarks:
- Image Tasks: MMBench, SQA, TextVQA, ChartQA, DocVQA, OCRBench
- Video Tasks: VideoMME, NeXT-QA, SEED-Bench-Video, MVBench
All benchmarks can be downloaded from their official sources and follow original instructions.
Many original datasets are provided in .parquet format. However, we convert most of them into .json files with images stored separately (personal preference).
We also provide several example annotation files under ./data/XXX_benchmark to illustrate the expected data format and directory structure.
Custom Benchmarks:
We develop a simple, user-friendly pipeline that ensures inference is fully compatible with the training forward pass. To add a new benchmark, you can follow the implementation of existing ones:
-
Implement the benchmark logic in
./src/evaluate/benchmarks/NEW_BENCH.pyclass CustomDataset(object): modality = "image" # or "video" def __init__(self, image_path="", anno_path="", pre_prompt="", post_prompt=""): # Load your annotations here self.data = [] def __len__(self): return len(self.data) def __getitem__(self, idx): # 1. Prepare common fields res = { "index": idx, "prompt": "Your formatted prompt", "GT": "Ground truth answer" } # 2. Add Media: Supports Image (PIL/Path) or Video (Path) # For Image Benchmarks: res.update({ "image": image, # Supports PIL.Image object OR image_path "image_path": image_path }) # OR For Video Benchmarks: # res.update({ # "video_path": video_path # }) return res -
Implement the corresponding evaluation metrics in
./src/evaluate/benchmarks/metrics/eval_NEW_BENCH.py -
Update
./src/evaluate/benchmarks/benchmarks_config.py
🏋️♀️ Training
Run Training
MODEL_PATH: path to the pretrained modelDATA_ROOT: root directory of your training dataDATA_JSON: JSON file describing the dataset (examples in./data/demo.json)
bash scripts/run_sft.sh
# For debugging (single GPU/NPU, no deepspeed, supports breakpoint):
# bash scripts/run_sft.sh debug
🧪 Evaluation
MODEL_PATH: path to the model checkpointBENCHMARKS: list of benchmarks for evaluationPPE_CONFIG: configuration options for different compression settings- ⚠️ Reminder: Edit
DATASET_CONFIGin./src/evaluate/benchmarks_config.pyaccording to your local setup.
Run Inference
MODEL_PATH=/path/to/model bash scripts/run_infer.sh
# For debugging (single GPU/NPU, supports breakpoint):
# bash scripts/run_infer.sh debug
📊 Benchmarks
🖼️ Image Tasks
1. Experiments on Qwen2.5-VL-3B-Instruct
| Qwen2.5-VL-3B-Instruct | Method | MMBench (EN) | MMBench (CN) | SQA* | TextVQA | DocVQA | OCRBench | ChartQA | Red. Ratio |
|---|---|---|---|---|---|---|---|---|---|
| Training-Free | Vanilla (report) | 79.10 | 78.10 | 76.14 | 79.30 | 93.90 | 797 | 84.00 | 0% |
| Chat-UniVi | 81.50 | 80.06 | 74.35 | 37.60 | 19.58 | 307 | 18.72 | 55% | |
| Chat-UniVi + PPE | 82.28 (+0.78) | 81.43 (+1.37) | 74.58 (+0.23) | 73.78 (+36.18) | 66.16 (+46.58) | 598 (+291) | 67.08 (+48.36) | 55% | |
| SFT | Dense | 85.89 | 86.07 | 79.39 | 79.50 | 89.44 | 761 | 79.96 | 0% |
| Chat-UniVi | 84.92 | 83.71 | 77.48 | 57.66 | 52.48 | 535 | 49.60 | 55% | |
| Chat-UniVi + PPE | 84.73 (-0.19) | 84.87 (+1.16) | 78.34 (+0.86) | 77.14 (+19.48) | 76.79 (24.31) | 691 (+156) | 74.52 (+24.92) | 55% |
* denotes reproduction results, as these benchmarks are not reported in the original paper.
2. Experiments on Qwen2.5-VL-7B-Instruct
We further extended our experiments to the 7B model. However, due to time and resource constraints, we trained it on only 1/5 of the data used for the 3B model.
| Qwen2.5-VL-7B-Instruct | Method | MMBench (EN) | MMBench (CN) | SQA* | TextVQA | DocVQA | OCRBench | ChartQA | Red. Ratio |
|---|---|---|---|---|---|---|---|---|---|
| Training-Free | Vanilla (report) | 83.50 | 83.40 | 85.52 | 84.90 | 95.70 | 864 | 87.30 | 0% |
| Chat-UniVi | 83.23 | 80.18 | 80.49 | 35.82 | 27.06 | 479 | 19.92 | 55% | |
| Chat-UniVi + PPE | 83.58 (+0.35) | 82.35 (+2.17) | 81.59 (+1.10) | 63.44 (+27.62) | 66.42 (+39.36) | 577 (+98) | 46.72 (+29.8) | 55% | |
| SFT | Dense | 86.90 | 85.35 | 84.83 | 87.20 | 92.97 | 826 | 86.32 | 0% |
| Chat-UniVi | 86.23 | 84.25 | 82.40 | 54.92 | 50.01 | 584 | 43.96 | 55% | |
| Chat-UniVi + PPE | 86.26 (+0.03) | 84.85 (+0.60) | 83.56 (+1.16) | 82.46 (+27.54) | 85.84 (+35.83) | 764 (+180) | 78.88 (+34.92) | 55% |
* denotes reproduction results, as these benchmarks are not reported in the original paper.
🎬 Video Tasks
| Qwen2.5-VL-3B-Instruct | Method | VideoMME (w/o subs) | VideoMME (w/ subs) | NeXT-QA (MC) | NeXT-QA (OE) | SEED-Bench-Video | MVBench | Avg. | Red. Ratio |
|---|---|---|---|---|---|---|---|---|---|
| SFT | Dense | 57.81 | 57.96 | 78.20 | 31.65 | 57.60 | 67.90 | 58.52 | 0% |
| Chat-UniVi | 57.22 | 57.22 | 77.63 | 25.37 | 56.08 | 66.90 | 56.74 | 55% | |
| Chat-UniVi + PPE | 58.70 (+1.48) | 59.07 (+1.85) | 78.42 (+0.42) | 32.61 (+7.24) | 55.98 (-0.10) | 67.38 (+0.48) | 58.69 (+1.95) | 55% | |
| + PPE Cascade | 58.48 | 58.52 | 78.20 | 32.20 | 56.11 | 67.35 | 58.48 | 90% |
📌 Citation
If you find this work helpful, please consider citing us:
@article{huang2025ppe,
title={PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models},
author={Mouxiao Huang and Borui Jiang and Dehua Zheng and Hailin Hu and Kai Han and Xinghao Chen},
journal={arXiv preprint arXiv:2510.22936},
year={2025}
}
❓ FAQ
Q1: Why are some baseline comparisons missing from this repo?
A: For convenience, we directly conducted comparisons using official implementations for PACT, ToMe, and VisionZip, etc.
Q2: Why are some benchmarks missing from this repo?
A: Due to internal compliance and the lengthy review process required for exporting code from our corporate environment, some benchmark implementations are currently unavailable. Even though these are based on open-source standards, the export process remains restrictive. However, we have designed the pipeline to be highly extensible. We encourage you to implement your own benchmarks using our straightforward template; it is designed for a seamless, plug-and-play experience.
Q3: Why is K=8 by default?
A: This version adapts to Qwen2.5-VL, which originally uses 3D-MRoPE (mrope_section=[16, 24, 24]). K=8 works well for both video and image experiments. For experiments strictly aligned with the paper's image-only results, please manually switch to 2D-MRoPE.
Q4: What if we set K=32 when using PPE with mrope_section=[16, 24, 24]?
A: It falls back to a repeating [1(T), 1(H), 1(W), ...] pattern. Since $64 is not divisible by \3, it only yields \21T/H/W$ triplets, leaving the remainder incomplete. Although not the intended implementation, the performance remains decent because the compressed token still captures multiple position cues rather than just one.
Q5: Is this the full implementation of the paper?
A: No. The currently released code is a cleaned and re-implemented version optimized for readability. Fully migrating and organizing every single experiment involves a significant amount of redundant manual labor. More importantly, the core idea of PPE is elegantly simple and easy to implement: compressed token RoPE embeddings should represent multiple original positions rather than a single point. Our goal is to provide this key insight to the community to foster further discussion and collaborative exploration.
🤝 Contributing
We welcome contributions from the community! Here's how to get started:
- Fork this repository
- Create a new feature branch:
git checkout -b feature/your-feature-name - Make your changes and commit them
- Push your branch and open a pull request
📬 Contact
- 💬 For questions, suggestions, or bug reports, please open an issue on GitHub or email us.
⚖️ License
📄 This project is licensed under the Apache License 2.0.
🙏 Acknowledgements
We build upon the inspiring work of: