MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

April 16, 2026 · View on GitHub

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

MaMe (Matrix-based token Merging) is a training-free, parameter-free token compression method for Vision Transformers. MaRe (Matrix-based token Restoration) is MaMe's inverse operation to form a MaMe + MaRe pipeline for generation tasks. All operations are expressed as dense matrix multiplications and element-wise ops, making the pipeline fully differentiable and GPU-friendly without any discrete sorting, top-k selection, or clustering.


Overview

PropertyMaMe / MaRe
Training required✗ (plug-and-play)
Extra parameters
Differentiable
Irregular indexing / gather-scatter
Adaptive compression ratio✓ (input-driven, threshold τ)
Token restoration✓ (MaRe)

ViT Image Classification

MaMe is applied to the first 8 layers of pre-trained ViT models without any fine-tuning. More results (fine-tuning, zero-shot classification, video) are available in the paper.

Training-Free Results (ImageNet-1K, A100, batch 1024, FP16)

ModelMethodFLOPs (G)Throughput (img/s)Top-1 (%)
ViT-S (DeiT)Baseline4.6503979.82
EViT2.3895073.83
ToMe2.3887477.99
DiffRate2.3887578.75
MaMe2.3901578.61
ViT-B (DeiT)Baseline17.6213081.83
EViT8.7423074.61
ToMe8.8402377.84
DiffRate8.7412478.98
MaMe8.7411779.80
ViT-B (MAE)Baseline17.6213083.72
MaMe8.7541879.83
ViT-L (MAE)Baseline61.675885.95
MaMe31.0276484.81
ViT-H (MAE)Baseline167.429986.88
MaMe93.290885.51

Visualisation

The figure below shows the token progression through the first 8 blocks of AugReg ViT-B/16 with MaMe. Each colour square represents a distinct token.

Token merging progression on a bird standing on the wood (single-target scene)


COCO Caption (LLaVA-1.5-7B)

MaMe is applied to the visual encoder of LLaVA-1.5-7B, reducing the number of visual tokens fed into the language model. Evaluated with VLMEvalKit on COCO Caption.

MethodLatency (s)Bleu-1Bleu-2Bleu-3Bleu-4ROUGE_LCIDEr
LLaVA-1.5-7B3.1220.7213.288.084.9320.940.71
+ ToMe (r=8)2.3020.1512.907.904.8921.671.60
+ MaMe (τ=0.8)2.6120.1012.877.874.8321.692.71

MaMe achieves a CIDEr score of 2.71 — a 3.8× improvement over the baseline (0.71) and 69% higher than ToMe (1.60) — while reducing latency by 16%. The gains trace back to MaMe's high-pass effect: by merging redundant low-frequency tokens and preserving distinctive detail tokens, the language model receives a more focused visual representation, leading to more spatially precise and complete descriptions.

Qualitative Example

COCO Caption evaluation image — plate with fish and broccoli


Image Generation

For diffusion models, MaMe is inserted inside each transformer block of the U-Net/DiT to shorten the self-attention sequence, and MaRe follows to restore the full spatial resolution before the residual connection. The MaMe+MaRe pipeline reduces per-step latency while the high-pass effect of MaMe enhances high-frequency texture details in the generated image. And somehow the images generated by MaMe+MaRe are more artistic.

Baseline: Stable Diffusion v2.1. Compared against ToMe-SD.

The combined images below show (left → right): SD v2.1 baseline | ToMe-SD | MaMe+MaRe. MaMe+MaRe recovers finer hair strands and fur textures that ToMe-SD loses.

[SD baseline | ToMe-SD | MaMe+MaRe] — Siberian tiger fur detail

Moreover, we can control the clarity and sharpness of the generated image by adjusting the similarity threshold.

threshold from low to high


Acknowledgements

This codebase borrows some code from ToMe and ToMe-SD. Thanks to the authors for their excellent work.


Citation

@article{mame2026,
  title   = {MaMe: Matrix-Based Token Merging},
  author  = {Simin Huo, Ning Li},
  booktitle={Computer Vision and Pattern Recognition Conference, Findings Track}
  year    = {2026}
}