๐ AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding
April 21, 2026 ยท View on GitHub
Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression (ACL 2025)
Authors
Xiao Wang1,2,โก, Qingyi Si2,โก, Jianlong Wu1*, Shiyu Zhu3, Li Cao2, Liqiang Nie1*
1 Harbin Institute of Technology, Shenzhen
2 Huawei Technologies Co., Ltd.
2 Shandong University
โก Equal contribution
* Corresponding authors
๐ค Reproduce with a Coding Agent (One Prompt)
Have a coding agent (Claude Code, Cursor, etc.) reproduce all paper results end-to-end with a single prompt:
Read AGENTS.md and reproduce the AdaReTaKe paper results end-to-end.
AGENTS.md contains everything the agent needs: environment setup, dataset preparation, eval commands, expected scores, and common failure modes.
๐ Overview
AdaReTaKe is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:
โ
Extends context capacity from 256 to 2048 frames
โ
Theoretically minimizes compression loss via adaptive ratio allocation
โ
Outperforms SOTA by +2.3% (7B) and +2.8% (72B) on four benchmarks
๐ฏ Key Contributions
| Feature | Innovation |
|---|---|
| Adaptive Redundancy Reduction | Layer-wise + timestamp-wise compression for maximal context retention |
| Scalability | Validated on 7B to 72B MLLMs with consistent gains |
| Theoretical Guarantee | Compression ratio allocation minimizes the loss upper bound |
๐ ๏ธ Setup
๐ Environment
# For GPU users
conda create -n retake python=3.11
pip install -r requirements.txt
# For NPU users (e.g., Ascend)
conda env create -f environment_npu.yaml
# Additional dependencies
apt-get install ffmpeg # Required for full video processing
pip install flash-attn==2.6.3 --no-build-isolation
๐ฆ Quick Start
1๏ธโฃ Configure Paths
Edit demo.py:
hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"
# NPU users: config_path = 'configs/demo_npu.yaml'
2๏ธโฃ (Optional) Convert LLaVA-Video Weights
python scripts/utils/convert_llava_video_weights_to_hf.py \
--text_model_id /path_to/Qwen2-7B-Instruct \
--vision_model_id /path_to/siglip-so400m-patch14-384 \
--output_hub_path /path_to/llava-video-qwen2-7b-hf \
--old_state_dict_id /path_to/LLaVAVideoQwen2_7B
3๏ธโฃ Run Demo
python demo.py
๐ Reproduce Results
Dataset Preparation
Evaluation Scripts
# Main results (paper configuration: temporal + AdaKV, 2048 frames)
bash main_results.sh
# Ablation study (1024 frames, 4 configs ร 4 datasets)
bash ablation.sh
Results saved in ./results
Main Results (Qwen2.5-VL-7B, Paper Configuration)
| Benchmark | Frames | FPS | Score |
|---|---|---|---|
| MLVU (M-AVG) | 2048 | 2 | 75.2 |
| LongVideoBench | 2048 | 2 | 61.6 |
| LVBench | 2048 | 2 | 50.4 |
| Video-MME | 2048 | 4 | 64.8 |
Ablation Study: Scaling to 1024 Frames
We conduct ablation experiments at 1024 frames (4ร the 256-frame setting used in the paper) to study how each component behaves when scaling to more frames. Four configurations are compared:
| Config | Temporal | Layer Allocation | Description |
|---|---|---|---|
no_both | โ | Even | Baseline |
no_layer | โ | Even | Temporal adaptation only |
no_temporal | โ | AdaKV | Layer allocation only |
full | โ | AdaKV | Full method (paper) |
Results (overall accuracy):
| Config | LVBench | LongVideoBench | MLVU | VideoMME | Avg |
|---|---|---|---|---|---|
Baseline (no_both) | 49.19 | 61.40 | 75.63 | 66.67 | 63.22 |
Temporal only (no_layer) | 49.97 | 61.48 | 75.94 | 66.63 | 63.51 |
AdaKV only (no_temporal) | 48.55 | 61.65 | 75.50 | 66.52 | 63.06 |
Full (full) | 48.48 | 62.22 | 75.41 | 66.19 | 63.08 |
Key observations at 1024-frame scale:
- Temporal adaptation remains consistently beneficial: it improves performance on LVBench (+0.78) and MLVU (+0.31), with neutral impact on the other two benchmarks. This confirms the generalizability of the temporal adaptation mechanism.
- Layer allocation shows dataset-dependent behavior: AdaKV layer allocation benefits LongVideoBench (+0.82 when combined with temporal), where subtitle-rich prompts create distinct cross-modal attention patterns across layers. However, it has negative impact on LVBench (โ0.64) and VideoMME (โ0.48). This divergence at higher frame counts warrants further investigation โ potentially through more fine-grained layer-wise budget strategies or dataset-adaptive allocation.
- LongVideoBench is unique: its questions include full subtitle transcripts (~3000 tokens avg), creating a fundamentally different attention landscape compared to purely visual benchmarks.
๐ License
Pending final release โ ๏ธ Research use only โ Commercial applications require explicit permission.