TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

April 22, 2026 · View on GitHub

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper Project Page License Python 3.10+

Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.

Weian Mao1*, Xi Lin3*, Wei Huang2*, Yuxin Xie1, Tianfu Fu1, Bohan Zhuang3, Song Han1,2, Yukang Chen2

1MIT, 2NVIDIA, 3ZJU    *Equal contribution

https://github.com/user-attachments/assets/768e59bb-897e-41bf-81b8-e7376aa72056

News

  • [2026-04-21] SGLang backend support added — TriAttention now runs on SGLang in addition to vLLM. See SGLang Integration.
  • [2026-04-14] Community DGX Spark (GB10/sm-121) enablement by @dscain — vLLM support merged, non-vLLM path in progress.
  • [2026-04-12] TriAttention now supports AR video generation with KV cache compression. See LongLive README.
  • [2026-04-11] Community C/ggml port for llama.cpp (HIP/ROCm) by @domvox — enables TriAttention on AMD GPUs via llama.cpp, with ~6.8× KV reduction when composed with TurboQuant. See triattention-ggml.
  • [2026-04-09] Experimental MLX and TurboQuant support for Apple Silicon (M1/M2/M3/M4) — thanks to @DeadByDawn101 (RavenX AI) for proposing and contributing this feature.

Highlights

  • 2.5x throughput on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8)
  • 10.7x KV memory reduction with trigonometric frequency-domain compression
  • OpenClaw compatible — enables local deployment on 24GB RTX 4090

TriAttention achieves 2.5x higher throughput and 10.7x KV memory reduction on AIME25 while matching Full Attention accuracy.

How It Works

Pre-RoPE Q/K vectors in long reasoning models concentrate around fixed centers that determine distance preferences via a trigonometric series. TriAttention scores keys using these centers and norms instead of requiring representative query selection, enabling accurate KV cache compression without the overhead of existing attention-based methods.

Documentation

Deploy with OpenClaw

TriAttention's vLLM server exposes an OpenAI-compatible API, which means you can use it directly as a custom provider in OpenClaw.

Quick Setup

  1. Follow the Installation instructions, then start a vLLM server with the recommended settings below.
  2. In OpenClaw, add a custom provider pointing to your vLLM server (e.g. http://localhost:8000/v1).

For manual configuration or troubleshooting, see the OpenClaw Manual Configuration Guide.

Interactive chat workloads differ from offline benchmarks — conversations are long-running and prefill chunks can trigger compression at unexpected points. We recommend the following adjustments:

# Required: path to precomputed frequency statistics
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt

# Use a larger KV budget for multi-turn chat (default: 2048)
export TRIATTN_RUNTIME_KV_BUDGET=12000

vllm serve <model_path> \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --enforce-eager \
    --trust-remote-code \
    --enable-prefix-caching false \
    --max-num-batched-tokens 1024

Key differences from the default server mode:

  • --enable-prefix-caching false — Prefix caching is incompatible with KV compression currently; disable it to avoid incorrect cache hits on compressed entries.
  • --max-num-batched-tokens 1024 — Limits the prefill chunk size. Large chunks can overshoot the KV budget in a single step before compression has a chance to trigger, leading to OOM.
  • TRIATTN_RUNTIME_KV_BUDGET=12000 — Chat sessions accumulate context across many turns; a larger budget (e.g. 12k) keeps more history available and avoids aggressive eviction.

Installation

git clone https://github.com/WeianMao/triattention.git
cd triattention
pip install -e .
pip install flash-attn --no-build-isolation  # recommended (takes 105m in DGX Spark / GB10)

Quick Start

python scripts/cli.py run-one \
    --model Qwen3-8B \
    --dataset aime24 \
    --method triattention \
    --budget 2048

Datasets

Benchmark datasets (AIME 2024, AIME 2025, MATH-500) are automatically downloaded from HuggingFace on first run -- no manual data preparation is needed. The evaluation scripts handle downloading, caching, and formatting transparently.

Supported Models

ModelHuggingFace IDStatus
Qwen3-8BQwen/Qwen3-8BVerified
DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Llama-8BVerified
DeepSeek-R1-Distill-Qwen-7Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-7BVerified

Results

AIME24 / AIME25 (KV Budget = 2048, DS-Llama = 512)

MethodQwen3-8BDS-Llama-8BDS-Qwen-7BGPT-OSS-20B
Full Attention57.1 / 40.850.4 / 31.443.8 / 34.269.2 / 60.0
SnapKV34.6 / 20.05.0 / 6.734.6 / 25.048.3 / 36.7
R-KV25.4 / 17.525.8 / 11.234.6 / 23.349.6 / 39.2
TriAttention42.1 / 32.933.8 / 19.642.5 / 30.059.2 / 49.2

Throughput (Qwen3-8B, tokens/sec)

BenchmarkTriAttn BudgetFull AccTriAttn AccFull ThroughputTriAttn ThroughputSpeedup
MATH-500102469.668.4222.81405.26.3x
AIME24409657.154.6222.8413.91.9x
AIME25307240.840.8222.8563.52.5x

See docs/results.md for complete results including MATH-500 accuracy table, accuracy vs. budget curves, and DFS memory retention analysis.

vLLM Integration

TriAttention includes a vLLM plugin that enables transparent KV cache compression for production deployment. After installation, vLLM automatically discovers and activates the plugin -- no code changes required.

Server Mode (OpenAI-Compatible API)

# Set compression parameters
export TRIATTN_RUNTIME_KV_BUDGET=2048
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt

# Launch vLLM server -- TriAttention activates automatically. Set `ENABLE_TRIATTENTION=0` to disable.
vllm serve <model_path> \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --enforce-eager \
    --trust-remote-code \
    --enable-prefix-caching false

# Use the standard OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "<model_path>", "messages": [{"role": "user", "content": "Solve: ..."}]}'

vLLM with DGX Spark / GB10

To enable vLLM in DGX Spark / GB10, run these installation steps instead:

uv venv
. .venv/bin/activate
uv pip install --index-url https://download.pytorch.org/whl/cu130 torch torchvision torchaudio
uv pip install \
  https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0-cp38-abi3-manylinux_2_31_aarch64.whl \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --extra-index-url https://pypi.org/simple \
  --index-strategy unsafe-best-match
uv pip install -e .

export TRITON_CACHE_DIR=~/.cache/.triton-cache
mkdir -p $TRITON_CACHE_DIR

PY_SITE=$(.venv/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")  # Or adjust as needed to your environment
export LD_LIBRARY_PATH="$PY_SITE/torch/lib:$PY_SITE/nvidia/cu13/lib:/usr/local/cuda/targets/sbsa-linux/lib:${LD_LIBRARY_PATH:-}"

vllm serve Qwen/Qwen3-8B \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --enforce-eager \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.7

Verify the first vLLM log line is [TriAttention] Runtime (V2) plugin activated: patch_scheduler=True patch_worker=True.

curl http://127.0.0.1:8000/v1/models
curl http://127.0.0.1:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"Qwen/Qwen3-8B","prompt":"hello","max_tokens":16}'

Python API

from triattention.vllm.runtime.integration_monkeypatch import (
    install_vllm_integration_monkeypatches,
)

# Install patches before creating the LLM instance
install_vllm_integration_monkeypatches(patch_scheduler=True, patch_worker=True)

# Standard vLLM API -- compression happens transparently
from vllm import LLM, SamplingParams

llm = LLM(
    model="<model_path>",
    dtype="bfloat16",
    max_model_len=32768,
    enforce_eager=True,
    trust_remote_code=True,
)

outputs = llm.generate(["Your prompt here"], SamplingParams(temperature=0.6, top_p=0.95))
print(outputs[0].outputs[0].text)

Configuration Reference

Environment VariableDefaultDescription
TRIATTN_RUNTIME_KV_BUDGET2048Maximum tokens retained in KV cache per request
TRIATTN_RUNTIME_DIVIDE_LENGTH128Compression trigger interval (every N new tokens)
TRIATTN_RUNTIME_WINDOW_SIZE128Recent tokens always preserved
TRIATTN_RUNTIME_PRUNING_MODEper_headToken selection strategy (per_head or per_layer_per_head)
TRIATTN_RUNTIME_SPARSE_STATS_PATH--Path to precomputed frequency statistics .pt file
TRIATTN_RUNTIME_PROTECT_PREFILLfalseProtect initial prompt tokens from eviction
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_KV_COMPACTIONtrueEnable in-place KV cache compaction
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_BLOCK_RECLAIMtrueEnable freed block reclamation
ENABLE_TRIATTENTIONtrueMaster switch to enable/disable the plugin

Precomputed Statistics

TriAttention requires precomputed Q/K frequency statistics for scoring. We provide pre-calibrated stats for supported models in triattention/vllm/stats/. See the Calibration Guide for generating stats for custom models.

Roadmap

  • vLLM integration
  • SGLang integration
  • Ollama integration
  • Support for more model architectures

Community Implementations

Independent ports and integrations maintained by the community:

ProjectStackMaintainerNotes
triattention-ggmlC/ggml, llama.cpp (HIP/ROCm)@domvoxAMD GPU support; composes with TurboQuant (~6.8× KV reduction). Includes pre-built calibration stats for Qwen3 family.

Note: Community projects are independently maintained and not officially supported. Please direct questions and issues to each project's own issue tracker.

Citation

@article{mao2026triattention,
    title={TriAttention: Efficient Long Reasoning with Trigonometric KV Compression},
    author={Weian Mao and Xi Lin and Wei Huang and Yuxin Xie and Tianfu Fu and Bohan Zhuang and Song Han and Yukang Chen},
    year={2026},
    eprint={2604.04921},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgements

We thank the following projects for their contributions and inspiration: R-KV | SnapKV

@DeadByDawn101 (RavenX AI) — MLX port for Apple Silicon

@kishan5111 — GPT-OSS-120B model integration

@dscain — DGX Spark (GB10) enablement for vLLM and non-vLLM paths

License

This project is licensed under the Apache License 2.0. See LICENSE for details.