MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

February 3, 2026 · View on GitHub

Official implementation of MARC (Memory-Augmented RL Token Compression), accepted at ICLR 2026.

🔥 News

[2026/02/02] Preliminary code release including training and inference scripts
[2026/01/22] Our paper is accepted at ICLR 2026!

Note: Training data and VMR code will be released in the future.

Overview

MARC is a novel framework for efficient video understanding that combines:

Visual Memory Retriever (VMR): Segments videos into event-level fragments and retrieves query-relevant clips
Compression Group Relative Policy Optimization (C-GRPO): An RL-based distillation strategy that compresses video tokens while preserving reasoning ability

Key Results

95% reduction in visual tokens (64 frames → 1 frame equivalent)
72% reduction in GPU memory usage
23.9% reduction in generation latency
Nearly identical performance to 64-frame baseline (42.20 vs 42.21 mean accuracy)

📐 Setup

git clone https://github.com/Gimlettt/MARC
cd MARC

# Create and activate conda environment
conda create -n marc python=3.11
conda activate marc

# Install base dependencies
bash setup.sh

# Install additional required packages
pip install wandb==0.18.3
pip install tensorboardx
pip install qwen_vl_utils torchvision
pip install flash-attn --no-build-isolation
pip install nltk
pip install rouge_score
pip install deepspeed

Replace Transformers Source Files

After installing transformers, you need to replace two files in your transformers installation with the modified versions that enable compression:

Replace <TRANSFORMERS_PATH>/models/qwen2_5_vl/modeling_qwen2_5_vl.py with qwen2_5_vl/modeling_qwen2_5_vl(compress).py
Replace <TRANSFORMERS_PATH>/models/qwen2_5_vl/processing_qwen2_5_vl.py with qwen2_5_vl/processing_qwen2_5_vl(compress).py

You can find your transformers installation path by running:

python -c "import transformers; import os; print(os.path.dirname(transformers.__file__))"

🔮 Inference

For a complete inference example, see inference_script/inference_example.py.

Benchmark Evaluation

To evaluate on benchmarks, use:

bash inference_script/eval_bench.sh

🚀 Training

C-GRPO Training

To train with Compression Group Relative Policy Optimization:

bash training_script/run_grpo_video.sh

Training script: training_script/grpo.py

Supervised Fine-Tuning (SFT)

For comparison with standard SFT:

bash training_script/run_sft_video.sh

Training script: training_script/sft_video.py

Results

Performance Comparison

Method	VSI	VideoMMMU	MMVU	MVBench	TempCompass	VideoMME	Mean
Qwen2.5-VL-3B (64f)	32.93	35.33	48.64	44.77	38.05	53.55	42.21
MARC-3B (1f)	27.55	33.11	51.99	45.82	55.34	39.44	42.20

Efficiency Improvements

Visual Tokens: 2589.93 → 122.69 (95% reduction)
GPU Memory: 41.6 GB → 11.5 GB (72% reduction)
Generation Latency: 0.46s → 0.35s (23.9% reduction)

Training Data

We use a subset of the Video-R1-260K dataset:

5K samples for C-GRPO training
Includes both video and image data
Covers multiple domains: Knowledge, Math, Chart, Spatial, OCR, General reasoning
See training data distribution in the paper

Note: Training data will be released in the future.

Citation

If you find MARC useful for your research, please cite:

@article{wu2025marc,
  title={MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding},
  author={Wu, Peiran and Yu, Zhuorui and Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
  journal={arXiv preprint arXiv:2510.07915},
  year={2025}
}

Acknowledgments

This project builds upon:

Video-R1 for the base training framework
Qwen2.5-VL for the base vision-language model
TRL for the GRPO implementation

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

For questions and feedback, please open an issue on GitHub.