VisualRWKV: A Visual Language Model Based on RWKV

March 6, 2025 ยท View on GitHub

Logo

๐Ÿ“– Paper | ๐Ÿค— Model | ๐Ÿฐ Demo

VisualRWKV is a visual language model based on the RWKV language model, enabling RWKV to handle various visual tasks.

Key Papers:

๐Ÿš€ News and Updates

  • 2025.02.10 ๐Ÿ”ฅ VisualRWKV-7.00 checkpoints released! [weights]
  • 2024.01.11 ๐Ÿ”ฅ VisualRWKV-7.00 code released! [code]
  • 2024.06.25 ๐Ÿ”ฅ VisualRWKV-6.0 checkpoints released! [weights]
  • 2024.05.11 ๐Ÿ”ฅ VisualRWKV-6.0 code released! [code]
  • 2024.03.25 ๐Ÿ”ฅ VisualRWKV-5.0 released!

๐Ÿ“Š VisualRWKV v7.0 Metrics

The following table presents the performance comparison between VisualRWKV v7.0 and its predecessor VisualRWKV v6 across several benchmark datasets.

Model NameVQAv2(test-dev)ScienceQA(IMG)TextVQAGQA(acc)Vision Encoder
v0700+0b175.2250.6237.9059.92SigLIP+dinov2+Sam
v0700+0b477.8554.9841.0562.30SigLIP+dinov2+Sam
v0700+1b579.8459.7449.4963.20SigLIP+dinov2+Sam
VisualRWKV - v6 1.6B73.6657.0248.7058.23SigLIP+dinov2+Sam
VisualRWKV - v6 3B71.5265.3448.6859.56CLIP
VisualRWKV - v6 7B75.8268.2251.0164.27CLIP

๐Ÿ—๏ธ Architecture

VisualRWKV Architecture

๐Ÿฆ„ Model Zoo

VisualRWKV weights, checkpoints, and related results can be found in the Model Zoo.


๐Ÿ’ป Installation

1. Clone the repository

Clone the repo and navigate to the VisualRWKV folder. Version 7.00 is the stable release.

git clone https://github.com/howard-hou/VisualRWKV.git
cd VisualRWKV-v7/v7.00

2. Install dependencies

Create a conda environment and install the necessary packages.

conda create -n visualrwkv python=3.10 -y
conda activate visualrwkv
pip install --upgrade pip  # Enable PEP 660 support

# Install dependencies:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install pytorch-lightning==1.9.5 deepspeed==0.7.0 wandb ninja

# For best performance, use the following:
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu126
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

๐Ÿ“š Pre-training and Fine-tuning

Latest stable version is VisualRWKV-v7/v7.00. Please navigate to the VisualRWKV-v7/v7.00 directory for running the code.

VisualRWKV training consists of two stages:

  1. Pre-training: Using a pretrain dataset to train a projection layer from a frozen pretrained vision encoder to the frozen RWKV.
  2. Fine-tuning: Using visual instruction data to teach the model to follow visual instructions.

๐Ÿ”ฅ Pre-training

Download LLaVA-Pretrain Dataset

You can download the LLaVA-Pretrain.

Download RWKV Checkpoints for Pre-training

If you want to pretrain the model yourself, download the following RWKV checkpoints.

VisualRWKV VersionRWKV 0B1RWKV 0B4RWKV 1B5RWKV 3BRWKV 7B
VisualRWKV-v6--RWKV-x060-World-1B6RWKV-x060-World-3BRWKV-x060-World-7B
VisualRWKV-v700RWKV-x070-World-0B1RWKV-x070-World-0B4RWKV-x070-World-1B5--

Pre-training Command

To pretrain the VisualRWKV-v7.0 model (example for using 4 GPUs with a 1B5 RWKV model): please refer to pretrain script


๐Ÿ”ง Visual Instruction Tuning

Prepare Data

Refer to the LLaVA project for visual instruction data.

Fine-tuning Command

To fine-tune the VisualRWKV-v7.0 model, please refer to fine-tune script