Vision-RWKV

February 18, 2025 · View on GitHub

The official implementation of "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures".

News🚀🚀🚀

2025/02/18: A new version of the CUDA code has been added in the cuda_new folder to eliminate the hardcoding of T_MAX.
2025/02/11: 🎊🎊 Vison-RWKV is accepted by ICLR 2025!
2024/04/14: We support rwkv6 in classification task, higher performance!
2024/03/04: We release the code and models of Vision-RWKV.

Highlights

High-Resolution Efficiency: Processed high-resolution images smoothly with a global receptive field.
Scalability: Pre-trained with large-scale datasets and posses scale up stablity.
Superior Performance: Achieved a better performance in classfication tasks than ViTs. Surpassed window-based ViTs and comparabled to global attention ViTs with lower flops and higher speed in dense prediction tasks.
Efficient Alternative: Capability to be an alternative backbone to ViT in comprehensive vision tasks.

Overview

Schedule

Support RWKV6 as VRWKV6
Release VRWKV-L
Release VRWKV-T/S/B

Model Zoo

Pretrained Models

Model	Size	Pretrain	Download
VRWKV-L	192	ImageNet-22K	ckpt

Image Classification (ImageNet-1K)

Model	Size	#Param	#FLOPs	Top-1 Acc	Download
VRWKV-T	224	6.2M	1.2G	75.1	ckpt \| cfg
VRWKV-S	224	23.8M	4.6G	80.1	ckpt \| cfg
VRWKV-B	224	93.7M	18.2G	82.0	ckpt \| cfg
VRWKV-L	384	334.9M	189.5G	86.0	ckpt \| cfg
VRWKV6-T	224	7.6M	1.6G	76.6	ckpt \| cfg
VRWKV6-S	224	27.7M	5.6G	81.1	ckpt \| cfg
VRWKV6-B	224	104.9M	20.9G	82.6	ckpt \| cfg

VRWKV-L is pretrained on ImageNet-22K and then finetuned on ImageNet-1K.
We train VRWKV-L with the internimage codebase for a higher speed.

Object Detection with Mask-RCNN head (COCO)

Model	#Param	#FLOPs	box AP	mask AP	Download
VRWKV-T	8.4M	67.9G	41.7	38.0	ckpt \| cfg
VRWKV-S	29.3M	189.9G	44.8	40.2	ckpt \| cfg
VRWKV-B	106.6M	599.0G	46.8	41.7	ckpt \| cfg
VRWKV-L	351.9M	1730.6G	50.6	44.9	ckpt \| cfg

We report the #Param and #FLOPs of the backbone in this table.

Semantic Segmentation with UperNet head (ADE20K)

Model	#Param	#FLOPs	mIoU	Download
VRWKV-T	8.4M	16.6G	43.3	ckpt \| cfg
VRWKV-S	29.3M	46.3G	47.2	ckpt \| cfg
VRWKV-B	106.6M	146.0G	49.2	ckpt \| cfg
VRWKV-L	351.9M	421.9G	53.5	ckpt \| cfg

We report the #Param and #FLOPs of the backbone in this table.

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{duan2024vrwkv,
  title={Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures},
  author={Duan, Yuchen and Wang, Weiyun and Chen, Zhe and Zhu, Xizhou and Lu, Lewei and Lu, Tong and Qiao, Yu and Li, Hongsheng and Dai, Jifeng and Wang, Wenhai},
  journal={arXiv preprint arXiv:2403.02308},
  year={2024}
}

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

Vision-RWKV is built with reference to the code of the following projects: RWKV, MMPretrain, MMDetection, MMSegmentation, ViT-Adapter, InternImage. Thanks for their awesome work!