VisualRWKV: A Visual Language Model Based on RWKV
March 6, 2025 ยท View on GitHub
๐ Paper | ๐ค Model | ๐ฐ Demo
VisualRWKV is a visual language model based on the RWKV language model, enabling RWKV to handle various visual tasks.
Key Papers:
- VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
๐ News and Updates
- 2025.02.10 ๐ฅ VisualRWKV-7.00 checkpoints released! [weights]
- 2024.01.11 ๐ฅ VisualRWKV-7.00 code released! [code]
- 2024.06.25 ๐ฅ VisualRWKV-6.0 checkpoints released! [weights]
- 2024.05.11 ๐ฅ VisualRWKV-6.0 code released! [code]
- 2024.03.25 ๐ฅ VisualRWKV-5.0 released!
๐ VisualRWKV v7.0 Metrics
The following table presents the performance comparison between VisualRWKV v7.0 and its predecessor VisualRWKV v6 across several benchmark datasets.
| Model Name | VQAv2(test-dev) | ScienceQA(IMG) | TextVQA | GQA(acc) | Vision Encoder |
|---|---|---|---|---|---|
| v0700+0b1 | 75.22 | 50.62 | 37.90 | 59.92 | SigLIP+dinov2+Sam |
| v0700+0b4 | 77.85 | 54.98 | 41.05 | 62.30 | SigLIP+dinov2+Sam |
| v0700+1b5 | 79.84 | 59.74 | 49.49 | 63.20 | SigLIP+dinov2+Sam |
| VisualRWKV - v6 1.6B | 73.66 | 57.02 | 48.70 | 58.23 | SigLIP+dinov2+Sam |
| VisualRWKV - v6 3B | 71.52 | 65.34 | 48.68 | 59.56 | CLIP |
| VisualRWKV - v6 7B | 75.82 | 68.22 | 51.01 | 64.27 | CLIP |
๐๏ธ Architecture
๐ฆ Model Zoo
VisualRWKV weights, checkpoints, and related results can be found in the Model Zoo.
๐ป Installation
1. Clone the repository
Clone the repo and navigate to the VisualRWKV folder. Version 7.00 is the stable release.
git clone https://github.com/howard-hou/VisualRWKV.git
cd VisualRWKV-v7/v7.00
2. Install dependencies
Create a conda environment and install the necessary packages.
conda create -n visualrwkv python=3.10 -y
conda activate visualrwkv
pip install --upgrade pip # Enable PEP 660 support
# Install dependencies:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install pytorch-lightning==1.9.5 deepspeed==0.7.0 wandb ninja
# For best performance, use the following:
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu126
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
๐ Pre-training and Fine-tuning
Latest stable version is VisualRWKV-v7/v7.00. Please navigate to the VisualRWKV-v7/v7.00 directory for running the code.
VisualRWKV training consists of two stages:
- Pre-training: Using a pretrain dataset to train a projection layer from a frozen pretrained vision encoder to the frozen RWKV.
- Fine-tuning: Using visual instruction data to teach the model to follow visual instructions.
๐ฅ Pre-training
Download LLaVA-Pretrain Dataset
You can download the LLaVA-Pretrain.
Download RWKV Checkpoints for Pre-training
If you want to pretrain the model yourself, download the following RWKV checkpoints.
| VisualRWKV Version | RWKV 0B1 | RWKV 0B4 | RWKV 1B5 | RWKV 3B | RWKV 7B |
|---|---|---|---|---|---|
| VisualRWKV-v6 | - | - | RWKV-x060-World-1B6 | RWKV-x060-World-3B | RWKV-x060-World-7B |
| VisualRWKV-v700 | RWKV-x070-World-0B1 | RWKV-x070-World-0B4 | RWKV-x070-World-1B5 | - | - |
Pre-training Command
To pretrain the VisualRWKV-v7.0 model (example for using 4 GPUs with a 1B5 RWKV model): please refer to pretrain script
๐ง Visual Instruction Tuning
Prepare Data
Refer to the LLaVA project for visual instruction data.
Fine-tuning Command
To fine-tune the VisualRWKV-v7.0 model, please refer to fine-tune script