README.md

June 8, 2026 ยท View on GitHub

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

Paper | Project Page | Model | Data

News

  • 2025.3.16 The RAP dataset is now available. Access it here.๐Ÿ”ฅ๐Ÿ”ฅ
  • 2025.2.27 RAP is accepted by CVPR 2025!๐ŸŽ‰๐ŸŽ‰
  • 2024.11.24 Release code and model weights.

Personalize Your Multimodal Large Language Model via Retrieval Augmented Generation.

RAP-MLLM
Introduce some user-specific concepts to our RAP-MLLM, it can remember them and achieve excellent performance in a variety of personalized multimodal generation tasks.

Visit our Project Page for more demostrations.

๐Ÿ“‹ Contents

๐Ÿ› ๏ธ Install

  1. Clone the repo into a local folder.
git clone https://github.com/Hoar012/RAP-MLLM.git

cd RAP-MLLM
  1. Install packages.
conda create -n rap python=3.10 -y
conda activate rap
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

pip install -r requirements.txt

๐Ÿค— Models

Pretrained model weights are available on Hugging Face.

RAP-LLaVA: RAP-LLaVA-13b; RAP-Phi3-V: RAP-Phi3-mini

๐Ÿ–ฅ๏ธ Demo

Build Your Personal Database:

Each concept record in the database can be structured with the following format:

{
    "concept_dict": {
        "<concept>": {
            "name": "concept_name",
            "image": "image_path",
            "info": "",
            "category": ""
        }
    },
    "path_to_concept": {
        "image_path": "<concept>",
    }
}

We provide an example of the database in example_database.

CLI Demo:

python cli.py --model-path Hoar012/RAP-LLaVA-13b --image-file /path/to/test_image --retrieval --database example_database --topK 1

๐Ÿ’พ Data

Please check Data for more detail.

๐Ÿš€ Training

We provide the training scripts with DeepSpeed below. Try training on your own dataset!

ModelRAP-LLaVARAP-Phi3-VLLaVA-LoRA
Scriptscriptscriptscript

๐Ÿ“Š Evaluation

Prepare Data

Please download the test data used in the paper from the repositories of MyVLM and Yo'LLaVA.

We also provide the images for multi-concept evaluation in this Google Drive link.

In addition, we provide the full database used for question answering at this Google Drive link.

Evaluation on Image Captioning

python eval/caption.py  --eval-file /path/to/eval_file --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 2

The eval-file records the image paths to be evaluated and their corresponding target concepts, formatted as follows:

{
    "/path/to/image": [
        "target_concept"
    ],
}

Evaluation on Question Answering

python eval/VQA.py --eval-file eval/yollava-visual-qa.json --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 1

Replace /path/to/output_file with the path to your output file, then run the following command to obtain the accuracy:

python eval/eval_qa.py --output_path /path/to/output_file

Evaluation on Visual Recognition

python eval/recognition.py --eval-file eval/recognition_test.json --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 1

BibTeX

@InProceedings{Hao_2025_CVPR,
    author    = {Hao, Haoran and Han, Jiaming and Li, Changsheng and Li, Yu-Feng and Yue, Xiangyu},
    title     = {RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {14538-14548}
}

Acknowledgement

LLaVA, MyVLM, YoLLaVA