README.md

May 14, 2025 · View on GitHub

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

¹Southeast University, ²King’s College London, ³The Alan Turing Institute

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

SCOPE is a simple yet effective framework designed to tackle the KV cache bottleneck in large language models (LLMs) during long-context generation. While existing methods primarily focus on the prefill phase, SCOPE introduces stage-level KV cache compression, addressing both prefill and decoding phases separately—an essential improvement for long-output reasoning tasks.

SCOPE is especially useful for LLM applications that require efficient, scalable generation with long outputs.

Comparison of Three Paradigms

Overview of Three Decoding Strategies

Key Observations

Excessive compression during the prefill phase which requires specific full context, impairs the comprehension of the reasoning task.
Deviation of heavy hitters occurs in the reasoning tasks with long outputs.

Excessive compression

Deviation of heavy hitters

We provide a notebook vis_topk_index_attn.ipynb to reproduce the Deviation of heavy hitters result(1× A100 (80GB) GPU).

Visualization

Attention heatmaps for layer 13 of a simplified GSM8k+ sample in LongGenBench:

We provide a notebook vis_attn_map.ipynb to reproduce the visualization result(1× A100 (80GB) GPU). Model attention maps for different layers would be stored at ./attention_map.

Requirements

torch==2.4.0
transformers==4.44.2
flash_attn==2.5.8

Environment Setup

conda create -n SCOPE
pip install -r requirements.txt

LongGenBench

Dataset Construction

Our dataset construction method is based on the original LongGenBench repository. We provide scripts for building the LongGenBench dataset as follows:

LongGenBench-4K

Dataset Script
GSM8K+ create_gsm8k_30.sh
MMLU+ create_mmlu_30.sh
CSQA+ create_csqa_40.sh
LongGenBench-8K

Dataset Script
GSM8K++ create_gsm8k_60.sh
MMLU++ create_mmlu_60.sh
CSQA++ create_csqa_80.sh

Dataset	Script
GSM8K+	`create_gsm8k_30.sh`
MMLU+	`create_mmlu_30.sh`
CSQA+	`create_csqa_40.sh`

Dataset	Script
GSM8K++	`create_gsm8k_60.sh`
MMLU++	`create_mmlu_60.sh`
CSQA++	`create_csqa_80.sh`

Example Usage

To generate the GSM8K+ dataset, run:

bash scripts/scripts_longgenbench/create_gsm8k_30.sh

Inference in LongGenBench

export CUDA_VISIBLE_DEVICES=\$1

method=\$2 # Support ALLKV, PyramidKV, PyramidInfer SnapKV, H2O, StreamingLLM
max_capacity_prompts=\$3
attn_implementation=\$4 # Support "flash_attention_2", "sdpa", "eager".
source_path=\$5
model_path=\$6
decoding_metric=\$7 # H2O Support None,h2o,(slide, adaptive, discontinuous)---SCOPE
decoding_window_size=\$8
save_dir=\$9 # path to result save_dir
K=\$10 #30,60
T=\$11

python3 run_longgenbench.py \
    --method ${method} \
    --model_path ${model_path} \
    --max_capacity_prompts ${max_capacity_prompts} \
    --attn_implementation ${attn_implementation} \
    --save_dir ${save_dir} \
    --use_cache True \
    --K ${K}\
    --decoding_window_size ${decoding_window_size} \
    --decoding_recent_size ${decoding_recent_size} \
    --decoding_metric ${decoding_metric} \
    --max_num_examples ${T} \

Eval Acc

results_dir=\$1

python3 eval_gen.py \
    --results_dir ${results_dir}

fix offset bug
improve README(expand documentation, add examples, and ensure clarity)
reorgnize the code for better using experience

Citation

If you find our work valuable, we would appreciate your citation: 🎈

@article{wu2024scope,
  title={SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation},
  author={Wu, Jialong and Wang, Zhenglin and Zhang, Linhai and Lai, Yilong and He, Yulan and Zhou, Deyu},
  journal={arXiv preprint arXiv:2412.13649},
  year={2024}
}

Acknowledgements

Thanks to SnapKV and PyramidKV (KVCache-Factory for providing open-source code to support the expansion of this project. 🎁
Special thanks to LOOK-M for the beautifully designed README template, which we referenced. 🎨
Shoutout to @Lueci4er on GitHub for valuable suggestions on code details, which we adopted. 🛠️