README.md

March 16, 2026 · View on GitHub

PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models

Mouxiao Huang*, Borui Jiang*✉, Dehua Zheng, Hailin Hu✉, Kai Han, Xinghao Chen✉

* Equal contribution

⭐ If you find this useful, a star would be appreciated

PPE Framework Overview

🗺 Roadmap

✅ Paper accepted at ICLR 2026
✅ Core PPE implementation released
✅ Training and inference pipeline
🔜 Additional benchmark support
🔜 Cascade compression for image inputs
🔜 Finetuned checkpoints
🔜 Extended backbone support
🔜 HuggingFace integration

📑 Table of Contents

🌟 Highlights
📦 Installation
🚀 Usage
📊 Benchmarks
- 🖼️ Image Tasks
- 🎬 Video Tasks
📌 Citation
❓ FAQ
🤝 Contributing
📬 Contact
⚖️ License
🙏 Acknowledgements

🌟 Highlights

Plug-and-Play & Parameter-Free: Works in a plug-and-play manner, without modifying original token selection or aggregation mechanisms.
Preserve Positional Information: Preserves richer positional cues under the same reduction ratio.
Training-Free & SFT: Mainly depends on the underlying compression method and can be better when training is allowed.
Broad Compatibility: Easily combines with various token compression methods.
Cascade Clustering Support: Facilitates multi-stage compression within LLM while maintaining performance.

📦 Installation

pip install -r requirements.txt

Ascend NPU: uncomment torch_npu in requirements.txt, then run the same command.

🚀 Usage

Pretrained Model

We conduct experiments primarily on Qwen2.5-VL-3B-Instruct. You can download the official pretrained model from here.

📁 Dataset

Training Data:

Due to computational limitations, our supervised fine-tuning (SFT) dataset is constructed from public sources:

LLaVA-Video-178K: 120K sampled instances
LLaVA-OneVision: 300K sampled instances

You may use the whole datasets or customize your own.

Data structure please refer to ./data/demo.json or Qwen-VL-Series-Finetune for more information.

Evaluation Benchmarks:

Image Tasks: MMBench, SQA, TextVQA, ChartQA, DocVQA, OCRBench
Video Tasks: VideoMME, NeXT-QA, SEED-Bench-Video, MVBench

All benchmarks can be downloaded from their official sources and follow original instructions. Many original datasets are provided in .parquet format. However, we convert most of them into .json files with images stored separately (personal preference).

We also provide several example annotation files under ./data/XXX_benchmark to illustrate the expected data format and directory structure.

Custom Benchmarks:

We develop a simple, user-friendly pipeline that ensures inference is fully compatible with the training forward pass. To add a new benchmark, you can follow the implementation of existing ones:

Implement the benchmark logic in ./src/evaluate/benchmarks/NEW_BENCH.py

class CustomDataset(object):
    modality = "image" # or "video"
    
    def __init__(self, image_path="", anno_path="", pre_prompt="", post_prompt=""):
        # Load your annotations here
        self.data = [] 

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):     
        # 1. Prepare common fields
        res = {
            "index": idx,
            "prompt": "Your formatted prompt",
            "GT": "Ground truth answer"
        }

        # 2. Add Media: Supports Image (PIL/Path) or Video (Path)
        # For Image Benchmarks:
        res.update({
            "image": image, # Supports PIL.Image object OR image_path
            "image_path": image_path 
        })

        # OR For Video Benchmarks:
        # res.update({
        #     "video_path": video_path
        # })

        return res

Implement the corresponding evaluation metrics in ./src/evaluate/benchmarks/metrics/eval_NEW_BENCH.py
Update ./src/evaluate/benchmarks/benchmarks_config.py

🏋️‍♀️ Training

Run Training

MODEL_PATH: path to the pretrained model
DATA_ROOT: root directory of your training data
DATA_JSON: JSON file describing the dataset (examples in ./data/demo.json)

bash scripts/run_sft.sh
# For debugging (single GPU/NPU, no deepspeed, supports breakpoint):
# bash scripts/run_sft.sh debug

🧪 Evaluation

MODEL_PATH: path to the model checkpoint
BENCHMARKS: list of benchmarks for evaluation
PPE_CONFIG: configuration options for different compression settings
⚠️ Reminder: Edit DATASET_CONFIG in ./src/evaluate/benchmarks_config.py according to your local setup.

Run Inference

MODEL_PATH=/path/to/model bash scripts/run_infer.sh
# For debugging (single GPU/NPU, supports breakpoint):
# bash scripts/run_infer.sh debug

📊 Benchmarks

🖼️ Image Tasks

1. Experiments on Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct	Method	MMBench (EN)	MMBench (CN)	SQA*	TextVQA	DocVQA	OCRBench	ChartQA	Red. Ratio
Training-Free	Vanilla (report)	79.10	78.10	76.14	79.30	93.90	797	84.00	0%
	Chat-UniVi	81.50	80.06	74.35	37.60	19.58	307	18.72	55%
	Chat-UniVi + PPE	82.28 (+0.78)	81.43 (+1.37)	74.58 (+0.23)	73.78 (+36.18)	66.16 (+46.58)	598 (+291)	67.08 (+48.36)	55%
SFT	Dense	85.89	86.07	79.39	79.50	89.44	761	79.96	0%
	Chat-UniVi	84.92	83.71	77.48	57.66	52.48	535	49.60	55%
	Chat-UniVi + PPE	84.73 (-0.19)	84.87 (+1.16)	78.34 (+0.86)	77.14 (+19.48)	76.79 (24.31)	691 (+156)	74.52 (+24.92)	55%

* denotes reproduction results, as these benchmarks are not reported in the original paper.

2. Experiments on Qwen2.5-VL-7B-Instruct

We further extended our experiments to the 7B model. However, due to time and resource constraints, we trained it on only 1/5 of the data used for the 3B model.

Qwen2.5-VL-7B-Instruct	Method	MMBench (EN)	MMBench (CN)	SQA*	TextVQA	DocVQA	OCRBench	ChartQA	Red. Ratio
Training-Free	Vanilla (report)	83.50	83.40	85.52	84.90	95.70	864	87.30	0%
	Chat-UniVi	83.23	80.18	80.49	35.82	27.06	479	19.92	55%
	Chat-UniVi + PPE	83.58 (+0.35)	82.35 （+2.17）	81.59 (+1.10)	63.44 (+27.62)	66.42 (+39.36)	577 (+98)	46.72 (+29.8)	55%
SFT	Dense	86.90	85.35	84.83	87.20	92.97	826	86.32	0%
	Chat-UniVi	86.23	84.25	82.40	54.92	50.01	584	43.96	55%
	Chat-UniVi + PPE	86.26 (+0.03)	84.85 (+0.60)	83.56 (+1.16)	82.46 (+27.54)	85.84 (+35.83)	764 (+180)	78.88 (+34.92)	55%

* denotes reproduction results, as these benchmarks are not reported in the original paper.

🎬 Video Tasks

Qwen2.5-VL-3B-Instruct	Method	VideoMME (w/o subs)	VideoMME (w/ subs)	NeXT-QA (MC)	NeXT-QA (OE)	SEED-Bench-Video	MVBench	Avg.	Red. Ratio
SFT	Dense	57.81	57.96	78.20	31.65	57.60	67.90	58.52	0%
	Chat-UniVi	57.22	57.22	77.63	25.37	56.08	66.90	56.74	55%
	Chat-UniVi + PPE	58.70 (+1.48)	59.07 (+1.85)	78.42 (+0.42)	32.61 (+7.24)	55.98 (-0.10)	67.38 (+0.48)	58.69 (+1.95)	55%
	+ PPE Cascade	58.48	58.52	78.20	32.20	56.11	67.35	58.48	90%

📌 Citation

If you find this work helpful, please consider citing us:

@article{huang2025ppe,
	title={PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models},
  	author={Mouxiao Huang and Borui Jiang and Dehua Zheng and Hailin Hu and Kai Han and Xinghao Chen},
  	journal={arXiv preprint arXiv:2510.22936},
  	year={2025}
}

❓ FAQ

Q1: Why are some baseline comparisons missing from this repo?

A: For convenience, we directly conducted comparisons using official implementations for PACT, ToMe, and VisionZip, etc.

Q2: Why are some benchmarks missing from this repo?

A: Due to internal compliance and the lengthy review process required for exporting code from our corporate environment, some benchmark implementations are currently unavailable. Even though these are based on open-source standards, the export process remains restrictive. However, we have designed the pipeline to be highly extensible. We encourage you to implement your own benchmarks using our straightforward template; it is designed for a seamless, plug-and-play experience.

Q3: Why is K=8 by default?

A: This version adapts to Qwen2.5-VL, which originally uses 3D-MRoPE (mrope_section=[16, 24, 24]). K=8 works well for both video and image experiments. For experiments strictly aligned with the paper's image-only results, please manually switch to 2D-MRoPE.

Q4: What if we set K=32 when using PPE with `mrope_section=[16, 24, 24]`?

A: It falls back to a repeating [1(T), 1(H), 1(W), ...] pattern. Since $64 $is not divisible by \$ 3 $, it only yields \$ 21 $complete$ T/H/W$ triplets, leaving the remainder incomplete. Although not the intended implementation, the performance remains decent because the compressed token still captures multiple position cues rather than just one.

Q5: Is this the full implementation of the paper?

A: No. The currently released code is a cleaned and re-implemented version optimized for readability. Fully migrating and organizing every single experiment involves a significant amount of redundant manual labor. More importantly, the core idea of PPE is elegantly simple and easy to implement: compressed token RoPE embeddings should represent multiple original positions rather than a single point. Our goal is to provide this key insight to the community to foster further discussion and collaborative exploration.

🤝 Contributing

We welcome contributions from the community! Here's how to get started:

Fork this repository
Create a new feature branch: git checkout -b feature/your-feature-name
Make your changes and commit them
Push your branch and open a pull request

📬 Contact

💬 For questions, suggestions, or bug reports, please open an issue on GitHub or email us.

⚖️ License

📄 This project is licensed under the Apache License 2.0.

🙏 Acknowledgements

We build upon the inspiring work of: