VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

May 27, 2026 · View on GitHub

Paper

Showcase

Full Attention VecAttention
Video Understanding
Video Generation

Abstract

Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose VecAttention, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65x speedup over full attention and a 1.83x speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention.

VecAttention overview

Attention runtime breakdown across methods

TilingSelect time and HBM access breakdown


Installation

Prerequisites

  • Linux with CUDA-capable NVIDIA GPU
  • Docker (for containerized setup)
  • uv — Python package manager
  • Git >= 2.13 (for submodule support)

0. Clone the repository

VecAttention uses git submodules for vllm-flash-attention and the DiTEvalKit third-party kernels. Always clone with --recurse-submodules:

git clone --recurse-submodules https://github.com/anminliu/VecAttention.git
cd VecAttention

If you already cloned without submodules, initialize them afterwards:

git submodule update --init --recursive

1. Build and start the Docker container (on the host)

make setup-host
# Optional: pass an HTTP proxy
make setup-host PROXY=http://your-proxy:port

This builds a Docker image tagged vecattention and starts a container named vecattention with full GPU access. The parent directory of the repo is mounted at /workspace so the repo appears at /workspace/VecAttention inside the container. Override defaults with DOCKER_IMAGE, DOCKER_CONTAINER, or HOST_WORKSPACE.

2. Set up the base Python environment (inside the container)

make setup-container

This creates a uv virtual environment (.venv/) and installs the core Python stack:

PackageVersion
Python3.10
PyTorch2.7.0 (CUDA 12.8)
Tritongit 6e390f3f
FlashInfer0.3.1.post1

3. Build the custom vLLM Flash Attention backend

VecAttention's sparse attention kernel requires our vllm-flash-attention submodule (a modified version of vllm-project/flash-attention). Make sure submodules were initialized recursively (see step 0), then build it once:

make fainstall
# If you need a clean rebuild:
make faclean && make fainstall

4. Environment for evaluation

# VLM Evaluation (Video-MME, LongVideoBench, VCRBench)
make vlminit
# DiT Evaluation (HunyuanVideo, Wan)
make ditinit

Note: vlm and dit groups conflict and cannot be installed simultaneously. Use make vlminit or make ditinit, not both.

If your nvcc is older than CUDA 12.8, export BLOCK_SPARSE_ATTN_CUDA_ARCHS to your GPU compute capability before running make vlminit/ditinit (e.g. export BLOCK_SPARSE_ATTN_CUDA_ARCHS=80 for A100, 89 for RTX 4090/L40, 90 for H100).

Model weights: All scripts default to looking for model weights under /workspace/models/ (the standard Docker mount point). Override globally by setting the MODEL_DIR environment variable, e.g. export MODEL_DIR=/path/to/your/models. Individual scripts also accept --model or --model_id arguments. Similarly, DATASET_DIR controls the dataset root (default: /workspace/datasets/).


Quick Use

import torch
from spattn.src.VecAttention import VecAttention_prefill

bsz, heads, seq_len, dim = 1, 32, 32768, 128
q = torch.randn(bsz, heads, seq_len, dim, dtype=torch.bfloat16, device="cuda")
k = torch.randn(bsz, heads, seq_len, dim, dtype=torch.bfloat16, device="cuda")
v = torch.randn(bsz, heads, seq_len, dim, dtype=torch.bfloat16, device="cuda")

out = VecAttention_prefill(
    q, k, v,
    threshold=0.9,        # MinP gap threshold (higher -> more sparse)
    q_pooling_size=64,    # Q block pooling size (Pq)
    k_local_size=16,      # K local block size (Bk)
    group_k_block=16,     # K-tile group size (Gk)
    causal=True,
    chunk_size=64 * 1024,
)

Demo

The demo/ directory contains three scripts for quickly trying VecAttention:

Vision NIAH demo

Loads a Qwen2.5-VL model, patches its attention layers with VecAttention, and runs a vision needle-in-a-haystack task to verify both correctness and speedup. Requires a long haystack video (≥1h recommended) on disk — pass its path via --haystack_movie_path.

source .venv/bin/activate

# Run with VecAttention
python demo/vision_demo.py --haystack_movie_path /path/to/movie.mp4 --nframe 180 --metric vecattention  --threshold 0.87

# Run with full attention (baseline comparison)
python demo/vision_demo.py --haystack_movie_path /path/to/movie.mp4 --nframe 180 --metric full

VLM evaluation demo

Quick pipeline check for video understanding evaluation (requires make vlminit):

bash demo/run_vlm_bench_demo.sh [GPU_IDS] [THRESHOLD] [NUM_SAMPLES]
# Example:
bash demo/run_vlm_bench_demo.sh 0 0.8 4

DiT evaluation demo

Quick pipeline check for video generation evaluation (requires make ditinit):

bash demo/run_dit_bench_demo.sh [GPU_IDS] [BACKEND] [INFER_STEP]
# Example:
bash demo/run_dit_bench_demo.sh 0 wan 2

Evaluation

On VideoMME at matched full-accuracy settings, VecAttention attains higher effective sparsity and faster attention computation, with low important-region selection overhead, compared to existing coarse-grained methods.

Video Understanding — VLMEvalKit

Setup: make vlminit

Supported models: Qwen2.5-VL-7B (qwenvl), InternVL-3.5-8B (internvl)

Supported benchmarks: VideoMME, LongVideoBench, VCRBench

Note: VLM evaluation uses a uniform threshold across all heads and does not require per-head dynamic programming (DP) threshold tuning. Simply pass a threshold value directly.

Single threshold run:

cd eval/VLMEvalKit

# VecAttention on all benchmarks with Qwen2.5-VL
bash run_single_th.sh vecattention 0 all results qwenvl 0.8

# Full attention baseline
bash run_single_th.sh full 0 all results qwenvl

Multi-threshold sweep (for Pareto curve):

cd eval/VLMEvalKit
bash run_multi_th.sh vecattention 0,1,2,3 all results qwenvl

Results

Performance on video understanding tasks (all values are percentages):

InternVL-3.5-8B (64 frames, ~17K tokens)

MethodAvg. SparsityVideoMMELongVideoBenchVCRBenchAvg. Acc.
Full Attention0.065.759.432.952.7
FlexPrefill76.552.359.030.047.1
XAttention78.156.059.932.549.5
AnchorAttention78.657.459.431.349.4
VecAttention78.660.659.033.851.1

Qwen2.5-VL-7B-Instruct (1 FPS, ~26K tokens)

MethodAvg. SparsityVideoMMELongVideoBenchVCRBenchAvg. Acc.
Full Attention0.063.959.925.849.9
FlexPrefill73.662.056.722.547.1
XAttention73.663.058.520.047.2
AnchorAttention74.664.460.822.949.4
VecAttention78.564.859.425.449.9

Video Generation — DiTEvalKit

Setup: make ditinit

Supported backends: HunyuanVideo-T2V-13B (hyvideo), Wan2.1-T2V-14B (wan)

DiT evaluation supports two modes for VecAttention: a uniform threshold (vecattention_wo_DP) that requires no setup, and per-head DP-tuned thresholds (vecattention) that yield better accuracy-sparsity trade-offs. The paper results use DP-tuned thresholds.

Without DP:

# VecAttention with uniform threshold on Wan
bash eval/DiTEvalKit/run_single_th.sh 0 vecattention_wo_DP wan 0.001

# Dense baseline
bash eval/DiTEvalKit/run_single_th.sh 0 dense wan

# XAttention baseline
bash eval/DiTEvalKit/run_single_th.sh 0 xattn hyvideo 0.9

With DP (per-head threshold tuning):

To reproduce the paper's DiT results, you need to run DP threshold tuning for each model. The process has three steps:

Step 1 — Dump QK matrices from a few reference prompts:

# For hyvideo (use --backend wan --dump-step 25 for Wan)
bash spattn/threshold/dump_qk_layers.sh \
    --backend hyvideo --dump-step 5 --gpus 0,1,2 --prompt-ids 0,1,2

This saves QK activations to spattn/threshold/QK_Cache/<Model>/.

Step 2 — Tune per-head thresholds via dynamic programming:

bash spattn/threshold/tune_cmd.sh 0,1,2 --model HunyuanVideo
# For Wan:
# bash spattn/threshold/tune_cmd.sh 0,1,2 --model Wan

tune_cmd.sh defaults to --prompt_to_list 0 1 2. If you dumped a different set of prompts in Step 1, override it via --extra-common-args "--prompt_to_list <ids>" (e.g. --extra-common-args "--prompt_to_list 0" when you only dumped prompt 0).

This produces per-head threshold files under spattn/threshold/tune_cache/<Model>/. You can optionally clean up the QK cache after tuning to free disk space:

rm -rf spattn/threshold/QK_Cache/<Model>/

Step 3 — Run evaluation with DP thresholds:

# The 'vecattention' method (without _wo_DP) automatically loads threshold files
# The 4th argument is the target sparsity (matches threshold filename)
bash eval/DiTEvalKit/run_single_th.sh 0 vecattention wan 0.550

If you want classifier-free guidance (CFG): CFG runs separate positive and negative attention passes, each needing its own threshold file. Run Step 1 twice — once with --dump-subdir positive and once with --dump-subdir negative — then repeat Step 2 twice with the matching --subdir. The vecattention method will auto-load both at evaluation time. To skip CFG and reuse a single threshold for both passes, set GUIDANCE_SCALE=0 when invoking the eval scripts.

Multi-threshold Pareto sweep:

bash eval/DiTEvalKit/run_multi_th.sh 0,1 "vecattention,dense,xattn,svg" "hyvideo,wan"

Calculate metrics (after inference completes):

Single-directory mode — compare one sparse-attention output directory against a dense reference:

python eval/DiTEvalKit/calc_metrics.py \
    --ref_dir path/to/dense \
    --out_dir path/to/vecattention

Simple batch mode — auto-discover and evaluate every output directory under a method, using the dense run as the reference:

python eval/DiTEvalKit/calc_metrics.py --method vecattention

Results

Performance on video generation tasks (50 inference steps, 6% warm-up ratio):

Wan 2.1-T2V-14B (720P, ~76K tokens)

MethodSparsity(%)PSNRSSIMLPIPS
XAttention54.619.70.6580.348
SVG52.218.70.6390.381
VecAttention52.319.70.6680.339

HunyuanVideo-T2V-13B (720P, ~119K tokens)

MethodSparsity(%)PSNRSSIMLPIPS
XAttention60.821.20.7340.348
SVG60.121.80.7690.326
VecAttention62.122.80.7790.330

Efficiency

Attention layer latency benchmark (requires Qwen2.5-VL-7B-Instruct weights):

source .venv/bin/activate
python spattn/benchmarks/bench_attn_layer.py

Compares full attention vs VecAttention (and optionally XAttention, FlexPrefill, AnchorAttention) across sequence lengths [8K, 16K, 32K, 64K, 128K]. Edit the methods_to_compare list in __main__ to enable additional methods.

Kernel autotuning: We ship pretuned Triton kernel configs under spattn/src/kernels/cache/ (for both vlm and dit environments). To retune on your own GPU for potentially better performance, run:

python spattn/src/kernels/vecattention_kernels.py

Citation

@inproceedings{vecattention2026,
  title     = {{VecAttention}: Vector-wise Sparse Attention for Accelerating Long Context Inference},
  author    = {Anmin Liu and Ruixuan Yang and Huiqiang Jiang and Bin Lin and Minmin Sun and Yong Li and Chen Zhang and Tao Xie},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

Thanks to: