rwkv.cpp

January 12, 2025 · View on GitHub

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

RWKV is a large language model architecture. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

This project supports RWKV v4, v5, v6 and the latest v7 architectures.

Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.

Quality and performance

If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.

Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads. The models are RWKV v4 Pile 169M, RWKV v4 Pile 1.5B.

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

With cuBLAS

Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8 GB. The model is RWKV-4-Pile-169M, 12 layers were offloaded to GPU.

Latency per token in ms shown.

Format	1 thread	2 threads	4 threads	8 threads	24 threads
`Q4_0`	7.9	6.2	6.9	8.6	20
`Q4_1`	7.8	6.7	6.9	8.6	21
`Q5_1`	8.1	6.7	6.9	9.0	22

Format	1 thread	2 threads	4 threads	8 threads	24 threads
`Q4_0`	59	51	50	54	94
`Q4_1`	59	51	49	54	94
`Q5_1`	77	69	67	72	101

Note: since cuBLAS is supported only for ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.

With hipBLAS

Measurements were made on CPU AMD Ryzen 9 5900X & GPU AMD Radeon RX 7900 XTX. The model is RWKV-novel-4-World-7B-20230810-ctx128k, 32 layers were offloaded to GPU.

Latency per token in ms shown.

Format	1 thread	2 threads	4 threads	8 threads	24 threads
`f16`	94	91	94	106	944
`Q4_0`	83	77	75	110	1692
`Q4_1`	85	80	85	93	1691
`Q5_1`	83	78	83	90	1115

Note: same as cuBLAS, hipBLAS only supports ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.

How to use

1. Clone the repo

Requirements: git.

git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows / Linux / MacOS

Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv library file into the repository directory.

On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.

Option 2.2. Build the library yourself

This option is recommended for maximum performance, because the library would be built specifically for your CPU and OS.

Windows

Requirements: CMake or CMake from anaconda, Build Tools for Visual Studio 2019.

cmake .
cmake --build . --config Release

If everything went OK, bin\Release\rwkv.dll file should appear.

Windows + cuBLAS

Refer to docs/cuBLAS_on_Windows.md for a comprehensive guide.

Windows + hipBLAS

Refer to docs/hipBLAS_on_Windows.md for a comprehensive guide.

Linux / MacOS

Requirements: CMake (Linux: sudo apt install cmake, MacOS: brew install cmake, anaconoda: cmake package).

cmake .
cmake --build . --config Release

Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64 after running cmake . — if it detects x86_64, edit the CMakeLists.txt file under the # Compile flags to add set(CMAKE_SYSTEM_PROCESSOR "arm64").

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

Linux / MacOS + cuBLAS

cmake . -DRWKV_CUBLAS=ON
cmake --build . --config Release

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

3. Get an RWKV model

Requirements: Python 3.x with PyTorch.

First, download a model from Hugging Face like this one.

Second, convert it into rwkv.cpp format using following commands:

# Windows
python python\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin FP16

# Linux / MacOS
python python/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin FP16

Optionally, quantize the model into one of quantized formats from the table above:

# Windows
python python\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q5_1.bin Q5_1

# Linux / MacOS
python python/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q5_1.bin Q5_1

4. Run the model

Using the command line

Requirements: Python 3.x with numpy. If using Pile or Raven models, tokenizers is also required.

To generate some text, run:

# Windows
python python\generate_completions.py C:\rwkv.cpp-169M-Q5_1.bin

# Linux / MacOS
python python/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin

To chat with a bot, run:

# Windows
python python\chat_with_bot.py C:\rwkv.cpp-169M-Q5_1.bin

# Linux / MacOS
python python/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin

Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.

Using in your own code

The short and simple script inference_example.py demostrates the use of rwkv.cpp in Python.

To use rwkv.cpp in C/C++, include the header rwkv.h.

To use rwkv.cpp in any other language, see Bindings section below. If your language is missing, you can try to bind to the C API using the tooling provided by your language.

Bindings

These projects wrap rwkv.cpp for easier use in other languages/frameworks.

Golang: seasonjs/rwkv
Node.js: Atome-FE/llama-node

Compatibility

ggml moves fast, and can occasionally break compatibility with older file formats.

rwkv.cpp will attempt it's best to explain why a model file can't be loaded and what next steps are available to the user.

For reference only, here is a list of latest versions of rwkv.cpp that have supported older formats. No support will be provided for these versions.

Q4_2, old layout of quantized formats
- commit 3ca9c7f, release with prebuilt binaries
Q4_3, Q4_1_O
- commit c736ef5, release with prebuilt binaries

See also docs/FILE_FORMAT.md for version numbers of rwkv.cpp model files and their changelog.

Contributing

Please follow the code style described in docs/CODE_STYLE.md.