MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
February 3, 2026 ยท View on GitHub
Official implementation of MARC (Memory-Augmented RL Token Compression), accepted at ICLR 2026.
๐ฅ News
- [2026/02/02] Preliminary code release including training and inference scripts
- [2026/01/22] Our paper is accepted at ICLR 2026!
Note: Training data and VMR code will be released in the future.
Overview
MARC is a novel framework for efficient video understanding that combines:
- Visual Memory Retriever (VMR): Segments videos into event-level fragments and retrieves query-relevant clips
- Compression Group Relative Policy Optimization (C-GRPO): An RL-based distillation strategy that compresses video tokens while preserving reasoning ability
Key Results
- 95% reduction in visual tokens (64 frames โ 1 frame equivalent)
- 72% reduction in GPU memory usage
- 23.9% reduction in generation latency
- Nearly identical performance to 64-frame baseline (42.20 vs 42.21 mean accuracy)
๐ Setup
git clone https://github.com/Gimlettt/MARC
cd MARC
# Create and activate conda environment
conda create -n marc python=3.11
conda activate marc
# Install base dependencies
bash setup.sh
# Install additional required packages
pip install wandb==0.18.3
pip install tensorboardx
pip install qwen_vl_utils torchvision
pip install flash-attn --no-build-isolation
pip install nltk
pip install rouge_score
pip install deepspeed
Replace Transformers Source Files
After installing transformers, you need to replace two files in your transformers installation with the modified versions that enable compression:
- Replace
<TRANSFORMERS_PATH>/models/qwen2_5_vl/modeling_qwen2_5_vl.pywithqwen2_5_vl/modeling_qwen2_5_vl(compress).py - Replace
<TRANSFORMERS_PATH>/models/qwen2_5_vl/processing_qwen2_5_vl.pywithqwen2_5_vl/processing_qwen2_5_vl(compress).py
You can find your transformers installation path by running:
python -c "import transformers; import os; print(os.path.dirname(transformers.__file__))"
๐ฎ Inference
For a complete inference example, see inference_script/inference_example.py.
Benchmark Evaluation
To evaluate on benchmarks, use:
bash inference_script/eval_bench.sh
๐ Training
C-GRPO Training
To train with Compression Group Relative Policy Optimization:
bash training_script/run_grpo_video.sh
Training script: training_script/grpo.py
Supervised Fine-Tuning (SFT)
For comparison with standard SFT:
bash training_script/run_sft_video.sh
Training script: training_script/sft_video.py
Results
Performance Comparison
| Method | VSI | VideoMMMU | MMVU | MVBench | TempCompass | VideoMME | Mean |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B (64f) | 32.93 | 35.33 | 48.64 | 44.77 | 38.05 | 53.55 | 42.21 |
| MARC-3B (1f) | 27.55 | 33.11 | 51.99 | 45.82 | 55.34 | 39.44 | 42.20 |
Efficiency Improvements
- Visual Tokens: 2589.93 โ 122.69 (95% reduction)
- GPU Memory: 41.6 GB โ 11.5 GB (72% reduction)
- Generation Latency: 0.46s โ 0.35s (23.9% reduction)
Training Data
We use a subset of the Video-R1-260K dataset:
- 5K samples for C-GRPO training
- Includes both video and image data
- Covers multiple domains: Knowledge, Math, Chart, Spatial, OCR, General reasoning
- See training data distribution in the paper
Note: Training data will be released in the future.
Citation
If you find MARC useful for your research, please cite:
@article{wu2025marc,
title={MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding},
author={Wu, Peiran and Yu, Zhuorui and Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
journal={arXiv preprint arXiv:2510.07915},
year={2025}
}
Acknowledgments
This project builds upon:
- Video-R1 for the base training framework
- Qwen2.5-VL for the base vision-language model
- TRL for the GRPO implementation
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contact
For questions and feedback, please open an issue on GitHub.