PolaFormer: Polarity-aware Linear Attention for Vision Transformers [ICLR 2025]

June 12, 2026 Β· View on GitHub

If you like our works, please support us with your stars⭐!

πŸš€ Welcome to the repo of PolaFormer!

This repo contains the official PyTorch code and pre-trained models for PolaFormer.

arXiv huggingface weights stars

πŸ”₯ News

  • [2026.06] πŸ”₯ If you are interested in linear attention, you might also want to check out our new work NalaFormer, which has been accepted by ICML 2026. [paper] [code]

  • [2025.10] πŸ”₯ The next version model, PolaFormer++ has been released. We warmly welcome the community to use and explore it!

  • [2025.04] πŸ”₯ The triton implementation of PolaFormer is released thanks to fbi_la library

  • [0205.01] πŸ”₯ Our paper has been accepted by The International Conference on Learning Representations (ICLR), 2025.

Introduction

Motivation

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from O(N2)\mathcal{O}(N^2) to O(N)\mathcal{O}(N) in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs and the high entropy, we propose the PolaFormer, which achieves a superior balance between expressive capability and efficiency.

Method

In this paper, we propose the polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we prove that the existence of a class of element-wise functions (with positive first and second derivatives) can reduce entropy in the attention distribution. Finally, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated.

Notably, we introduce two learnable polarity-aware coefficients matrices applied with element-wise multiplication, which are expected to learn the complementary relationship between same-signed and opposite-signed values.

Results

  • Comparison of different models on ImageNet-1K.

  • Performance on Long Range Arena benchmark.
ModelTextListOpsRetrievalPathfinderImageAverage
PolaFormerΞ±=3\text{PolaFormer}_{\alpha=3}73.0637.3580.5070.5342.1560.72
PolaFormerΞ±=5\text{PolaFormer}_{\alpha=5}72.3338.7680.3768.9841.9160.47
PolaFormerΞ±=7\text{PolaFormer}_{\alpha=7}71.9337.6081.4769.0942.7760.57

Dependencies

  • Python 3.9
  • PyTorch == 1.11.0
  • torchvision == 0.12.0
  • numpy
  • timm == 0.4.12
  • einops
  • yacs

Data preparation

The ImageNet dataset should be prepared as follows:

$ tree data
imagenet
β”œβ”€β”€ train
β”‚   β”œβ”€β”€ class1
β”‚   β”‚   β”œβ”€β”€ img1.jpeg
β”‚   β”‚   β”œβ”€β”€ img2.jpeg
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ class2
β”‚   β”‚   β”œβ”€β”€ img3.jpeg
β”‚   β”‚   └── ...
β”‚   └── ...
└── val
    β”œβ”€β”€ class1
    β”‚   β”œβ”€β”€ img4.jpeg
    β”‚   β”œβ”€β”€ img5.jpeg
    β”‚   └── ...
    β”œβ”€β”€ class2
    β”‚   β”œβ”€β”€ img6.jpeg
    β”‚   └── ...
    └── ...

Pretrained Models

Based on different model architectures, we provide several pretrained models, as listed below.

modelResoacc@1config
Pola-PVT-T$2$24^{2}$$78.8 (+3.7)config
Pola-PVT-S$2$24^{2}$$81.9 (+2.1)config
Pola-Swin-T$2$24^{2}$$82.6 (+1.4)config
Pola-Swin-S$2$24^{2}$$83.6 (+0.6)config
Pola-Swin-B$2$24^{2}$$83.8 (+0.3)config

Evaluate one model on ImageNet:

python -m torch.distributed.launch --nproc_per_node=8 main.py --cfg <path-to-config-file> --data-path <imagenet-path> --output <output-path> --eval --resume <path-to-pretrained-weights>

Train Models from Scratch

  • To train our model on ImageNet from scratch, see pretrain.sh and run:
bash pretrain.sh

Acknowledgements

This code is developed on the top of Swin Transformer and FLatten Transformer.

Citation

If you find this repo helpful, please consider citing us.

@inproceedings{
meng2025polaformer,
title={PolaFormer: Polarity-aware Linear Attention for Vision Transformers},
author={Weikang Meng and Yadan Luo and Xin Li and Dongmei Jiang and Zheng Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=kN6MFmKUSK}
}

Contact

If you have any questions, please feel free to contact the authors.

Weikang Meng: zacharymengwk@gmail.com