MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

April 16, 2026 · View on GitHub

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

MaMe (Matrix-based token Merging) is a training-free, parameter-free token compression method for Vision Transformers. MaRe (Matrix-based token Restoration) is MaMe's inverse operation to form a MaMe + MaRe pipeline for generation tasks. All operations are expressed as dense matrix multiplications and element-wise ops, making the pipeline fully differentiable and GPU-friendly without any discrete sorting, top-k selection, or clustering.

Overview

Property	MaMe / MaRe
Training required	✗ (plug-and-play)
Extra parameters	✗
Differentiable	✓
Irregular indexing / gather-scatter	✗
Adaptive compression ratio	✓ (input-driven, threshold τ)
Token restoration	✓ (MaRe)

ViT Image Classification

MaMe is applied to the first 8 layers of pre-trained ViT models without any fine-tuning. More results (fine-tuning, zero-shot classification, video) are available in the paper.

Training-Free Results (ImageNet-1K, A100, batch 1024, FP16)

Model	Method	FLOPs (G)	Throughput (img/s)	Top-1 (%)
ViT-S (DeiT)	Baseline	4.6	5039	79.82
	EViT	2.3	8950	73.83
	ToMe	2.3	8874	77.99
	DiffRate	2.3	8875	78.75
	MaMe	2.3	9015	78.61
ViT-B (DeiT)	Baseline	17.6	2130	81.83
	EViT	8.7	4230	74.61
	ToMe	8.8	4023	77.84
	DiffRate	8.7	4124	78.98
	MaMe	8.7	4117	79.80
ViT-B (MAE)	Baseline	17.6	2130	83.72
	MaMe	8.7	5418	79.83
ViT-L (MAE)	Baseline	61.6	758	85.95
	MaMe	31.0	2764	84.81
ViT-H (MAE)	Baseline	167.4	299	86.88
	MaMe	93.2	908	85.51

Visualisation

The figure below shows the token progression through the first 8 blocks of AugReg ViT-B/16 with MaMe. Each colour square represents a distinct token.

Token merging progression on a bird standing on the wood (single-target scene)

COCO Caption (LLaVA-1.5-7B)

MaMe is applied to the visual encoder of LLaVA-1.5-7B, reducing the number of visual tokens fed into the language model. Evaluated with VLMEvalKit on COCO Caption.

Method	Latency (s)	Bleu-1	Bleu-2	Bleu-3	Bleu-4	ROUGE_L	CIDEr
LLaVA-1.5-7B	3.12	20.72	13.28	8.08	4.93	20.94	0.71
+ ToMe (r=8)	2.30	20.15	12.90	7.90	4.89	21.67	1.60
+ MaMe (τ=0.8)	2.61	20.10	12.87	7.87	4.83	21.69	2.71

MaMe achieves a CIDEr score of 2.71 — a 3.8× improvement over the baseline (0.71) and 69% higher than ToMe (1.60) — while reducing latency by 16%. The gains trace back to MaMe's high-pass effect: by merging redundant low-frequency tokens and preserving distinctive detail tokens, the language model receives a more focused visual representation, leading to more spatially precise and complete descriptions.

Qualitative Example

COCO Caption evaluation image — plate with fish and broccoli

Image Generation

For diffusion models, MaMe is inserted inside each transformer block of the U-Net/DiT to shorten the self-attention sequence, and MaRe follows to restore the full spatial resolution before the residual connection. The MaMe+MaRe pipeline reduces per-step latency while the high-pass effect of MaMe enhances high-frequency texture details in the generated image. And somehow the images generated by MaMe+MaRe are more artistic.

Baseline: Stable Diffusion v2.1. Compared against ToMe-SD.

The combined images below show (left → right): SD v2.1 baseline | ToMe-SD | MaMe+MaRe. MaMe+MaRe recovers finer hair strands and fur textures that ToMe-SD loses.

[SD baseline | ToMe-SD | MaMe+MaRe] — Siberian tiger fur detail

Moreover, we can control the clarity and sharpness of the generated image by adjusting the similarity threshold.

threshold from low to high

Acknowledgements

This codebase borrows some code from ToMe and ToMe-SD. Thanks to the authors for their excellent work.

Citation

@article{mame2026,
  title   = {MaMe: Matrix-Based Token Merging},
  author  = {Simin Huo, Ning Li},
  booktitle={Computer Vision and Pattern Recognition Conference, Findings Track}
  year    = {2026}
}