๐ŸŒŸ AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding

April 21, 2026 ยท View on GitHub

Paper Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression (ACL 2025)


Authors

Xiao Wang1,2,โ€ก, Qingyi Si2,โ€ก, Jianlong Wu1*, Shiyu Zhu3, Li Cao2, Liqiang Nie1*

1 Harbin Institute of Technology, Shenzhen
2 Huawei Technologies Co., Ltd.
2 Shandong University
โ€ก Equal contribution * Corresponding authors


๐Ÿค– Reproduce with a Coding Agent (One Prompt)

Have a coding agent (Claude Code, Cursor, etc.) reproduce all paper results end-to-end with a single prompt:

Read AGENTS.md and reproduce the AdaReTaKe paper results end-to-end.

AGENTS.md contains everything the agent needs: environment setup, dataset preparation, eval commands, expected scores, and common failure modes.


๐Ÿ” Overview

AdaReTaKe is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:
โœ… Extends context capacity from 256 to 2048 frames
โœ… Theoretically minimizes compression loss via adaptive ratio allocation
โœ… Outperforms SOTA by +2.3% (7B) and +2.8% (72B) on four benchmarks

AdaReTaKe Framework


๐ŸŽฏ Key Contributions

FeatureInnovation
Adaptive Redundancy ReductionLayer-wise + timestamp-wise compression for maximal context retention
ScalabilityValidated on 7B to 72B MLLMs with consistent gains
Theoretical GuaranteeCompression ratio allocation minimizes the loss upper bound

๐Ÿ› ๏ธ Setup

๐ŸŒ Environment

# For GPU users
conda create -n retake python=3.11
pip install -r requirements.txt

# For NPU users (e.g., Ascend)
conda env create -f environment_npu.yaml

# Additional dependencies
apt-get install ffmpeg  # Required for full video processing
pip install flash-attn==2.6.3 --no-build-isolation

๐Ÿšฆ Quick Start

1๏ธโƒฃ Configure Paths

Edit demo.py:

hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"  
# NPU users: config_path = 'configs/demo_npu.yaml'

2๏ธโƒฃ (Optional) Convert LLaVA-Video Weights

python scripts/utils/convert_llava_video_weights_to_hf.py \
  --text_model_id /path_to/Qwen2-7B-Instruct \
  --vision_model_id /path_to/siglip-so400m-patch14-384 \
  --output_hub_path /path_to/llava-video-qwen2-7b-hf \
  --old_state_dict_id /path_to/LLaVAVideoQwen2_7B

3๏ธโƒฃ Run Demo

python demo.py

๐Ÿ“ˆ Reproduce Results

Dataset Preparation

Evaluation Scripts

# Main results (paper configuration: temporal + AdaKV, 2048 frames)
bash main_results.sh

# Ablation study (1024 frames, 4 configs ร— 4 datasets)
bash ablation.sh

Results saved in ./results

Main Results (Qwen2.5-VL-7B, Paper Configuration)

BenchmarkFramesFPSScore
MLVU (M-AVG)2048275.2
LongVideoBench2048261.6
LVBench2048250.4
Video-MME2048464.8

Ablation Study: Scaling to 1024 Frames

We conduct ablation experiments at 1024 frames (4ร— the 256-frame setting used in the paper) to study how each component behaves when scaling to more frames. Four configurations are compared:

ConfigTemporalLayer AllocationDescription
no_bothโœ—EvenBaseline
no_layerโœ“EvenTemporal adaptation only
no_temporalโœ—AdaKVLayer allocation only
fullโœ“AdaKVFull method (paper)

Results (overall accuracy):

ConfigLVBenchLongVideoBenchMLVUVideoMMEAvg
Baseline (no_both)49.1961.4075.6366.6763.22
Temporal only (no_layer)49.9761.4875.9466.6363.51
AdaKV only (no_temporal)48.5561.6575.5066.5263.06
Full (full)48.4862.2275.4166.1963.08

Key observations at 1024-frame scale:

  • Temporal adaptation remains consistently beneficial: it improves performance on LVBench (+0.78) and MLVU (+0.31), with neutral impact on the other two benchmarks. This confirms the generalizability of the temporal adaptation mechanism.
  • Layer allocation shows dataset-dependent behavior: AdaKV layer allocation benefits LongVideoBench (+0.82 when combined with temporal), where subtitle-rich prompts create distinct cross-modal attention patterns across layers. However, it has negative impact on LVBench (โˆ’0.64) and VideoMME (โˆ’0.48). This divergence at higher frame counts warrants further investigation โ€” potentially through more fine-grained layer-wise budget strategies or dataset-adaptive allocation.
  • LongVideoBench is unique: its questions include full subtitle transcripts (~3000 tokens avg), creating a fundamentally different attention landscape compared to purely visual benchmarks.

๐Ÿ“„ License

Pending final release โš ๏ธ Research use only โ€” Commercial applications require explicit permission.