README.md

March 20, 2025 Β· View on GitHub

background
🌐 Project Page Β Β  πŸ“– arXiv Paper Β Β  πŸ“œ Documentation Β Β  πŸ“Š Dataset Β Β  πŸ€— Hugging Face Β Β  πŸ† Leaderboard

Truthfulness Safety Robustness Fairness Privacy


MultiTrust is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges.

framework

πŸš€ News

πŸ› οΈ Installation

The envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch v0.1.0.

  • Option A: UV install

    uv venv --python 3.9
    source .venv/bin/activate
    
    uv pip install setuptools
    uv pip install torch==2.3.0
    uv pip sync --no-build-isolation env/requirements.txt
    
  • Option B: Docker

    • How to install docker

      # Our docker version:
      #     Client: Docker Engine - Community
      #     Version:           27.0.0-rc.1
      #     API version:       1.46
      #     Go version:        go1.21.11
      #     OS/Arch:           linux/amd64
      
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
      curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
      
      sudo apt-get update
      sudo apt-get install -y nvidia-container-toolkit
      
      sudo systemctl restart docker
      sudo usermod -aG docker [your_username_here]
      
    • Get our image:

      • B.1: Pull image from DockerHub

        docker pull jankinfstmrvv/multitrust:latest
        
      • B.2: Build from scratch

        #  Note: 
        # [data] is the `absolute paths` of data.
        
        docker build --network=host -t multitrust:latest -f env/Dockerfile .
        
    • Start a container:

      docker run -it \
          --name multitrust \
          --gpus all \
          --privileged=true \
          --shm-size=10gb \
          -v $HOME/.cache/huggingface:/root/.cache/huggingface \
          -v $HOME/.cache/torch:/root/.cache/torch \
          -v [data]:/root/MMTrustEval/data \
          -w /root/MMTrustEval \
          -d multitrust:latest /bin/bash
      
      # entering the container
      docker exec -it multitrust /bin/bash
      
  • Several tasks require the use of commercial APIs for auxiliary testing. Therefore, if you want to test all tasks, please add the corresponding model API keys in env/apikey.yml.

:envelope: Dataset

License

  • The codebase is licensed under the CC BY-SA 4.0 license.

  • MultiTrust is only used for academic research. Commercial use in any form is prohibited.

  • If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.

Data Preparation

Refer here for detailed instructions.

πŸ“š Docs

Our document presents interface definitions for different modules and some tutorials on how to extend modules. Running online at: https://thu-ml.github.io/MMTrustEval/

Run following command to see the docs(locally).

mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000

πŸ“ˆ Reproduce results in Our paper

Running scripts under scripts/run can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.

πŸ“Œ To Make Inference

# Description: Run scripts require a model_id to run inference tasks.
# Usage: bash scripts/run/*/*.sh <model_id>

scripts/run
β”œβ”€β”€ fairness_scripts
β”‚   β”œβ”€β”€ f1-stereo-generation.sh
β”‚   β”œβ”€β”€ f2-stereo-agreement.sh
β”‚   β”œβ”€β”€ f3-stereo-classification.sh
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.sh
β”‚   β”œβ”€β”€ f4-stereo-query.sh
β”‚   β”œβ”€β”€ f5-vision-preference.sh
β”‚   β”œβ”€β”€ f6-profession-pred.sh
β”‚   └── f7-subjective-preference.sh
β”œβ”€β”€ privacy_scripts
β”‚   β”œβ”€β”€ p1-vispriv-recognition.sh
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.sh
β”‚   β”œβ”€β”€ p3-infoflow.sh
β”‚   β”œβ”€β”€ p4-pii-query.sh
β”‚   β”œβ”€β”€ p5-visual-leakage.sh
β”‚   └── p6-pii-leakage-in-conversation.sh
β”œβ”€β”€ robustness_scripts
β”‚   β”œβ”€β”€ r1-ood-artistic.sh
β”‚   β”œβ”€β”€ r2-ood-sensor.sh
β”‚   β”œβ”€β”€ r3-ood-text.sh
β”‚   β”œβ”€β”€ r4-adversarial-untarget.sh
β”‚   β”œβ”€β”€ r5-adversarial-target.sh
β”‚   └── r6-adversarial-text.sh
β”œβ”€β”€ safety_scripts
β”‚   β”œβ”€β”€ s1-nsfw-image-description.sh
β”‚   β”œβ”€β”€ s2-risk-identification.sh
β”‚   β”œβ”€β”€ s3-toxic-content-generation.sh
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.sh
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.sh
β”‚   └── s6-crossmodal-jailbreaking.sh
└── truthfulness_scripts
    β”œβ”€β”€ t1-basic.sh
    β”œβ”€β”€ t2-advanced.sh
    β”œβ”€β”€ t3-instruction-enhancement.sh
    β”œβ”€β”€ t4-visual-assistance.sh
    β”œβ”€β”€ t5-text-misleading.sh
    β”œβ”€β”€ t6-visual-confusion.sh
    └── t7-visual-misleading.sh

πŸ“Œ To Evaluate Results

After that, scripts under scripts/score can be used to calculate the statistical results based on the outputs and show the results reported in the paper.

# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_id <model_id>

scripts/score
β”œβ”€β”€ fairness
β”‚   β”œβ”€β”€ f1-stereo-generation.py
β”‚   β”œβ”€β”€ f2-stereo-agreement.py
β”‚   β”œβ”€β”€ f3-stereo-classification.py
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.py
β”‚   β”œβ”€β”€ f4-stereo-query.py
β”‚   β”œβ”€β”€ f5-vision-preference.py
β”‚   β”œβ”€β”€ f6-profession-pred.py
β”‚   └── f7-subjective-preference.py
β”œβ”€β”€ privacy
β”‚   β”œβ”€β”€ p1-vispriv-recognition.py
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.py
β”‚   β”œβ”€β”€ p3-infoflow.py
β”‚   β”œβ”€β”€ p4-pii-query.py
β”‚   β”œβ”€β”€ p5-visual-leakage.py
β”‚   └── p6-pii-leakage-in-conversation.py
β”œβ”€β”€ robustness
β”‚   β”œβ”€β”€ r1-ood_artistic.py
β”‚   β”œβ”€β”€ r2-ood_sensor.py
β”‚   β”œβ”€β”€ r3-ood_text.py
β”‚   β”œβ”€β”€ r4-adversarial_untarget.py
β”‚   β”œβ”€β”€ r5-adversarial_target.py
β”‚   └── r6-adversarial_text.py
β”œβ”€β”€ safefy
β”‚   β”œβ”€β”€ s1-nsfw-image-description.py
β”‚   β”œβ”€β”€ s2-risk-identification.py
β”‚   β”œβ”€β”€ s3-toxic-content-generation.py
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.py
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.py
β”‚   └── s6-crossmodal-jailbreaking.py
└── truthfulness
    β”œβ”€β”€ t1-basic.py
    β”œβ”€β”€ t2-advanced.py
    β”œβ”€β”€ t3-instruction-enhancement.py
    β”œβ”€β”€ t4-visual-assistance.py
    β”œβ”€β”€ t5-text-misleading.py
    β”œβ”€β”€ t6-visual-confusion.py
    └── t7-visual-misleading.py

πŸ“Œ Task List

The total 32 tasks are listed here and β—‹: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation by GPT-4 or other classifiers; ◐: mixture evaluation.

IDTask NameMetricsTask TypeEval
T.1Basic World UnderstandingAccuracy (↑\uparrow)Dis.&Gen.◐
T.2Advanced Cognitive InferenceAccuracy (↑\uparrow)Dis.β—‹
T.3VQA under Instruction EnhancementAccuracy (↑\uparrow)Gen.●
T.4QA under Visual AssistanceAccuracy (↑\uparrow)Gen.●
T.5Text Misleading VQAAccuracy (↑\uparrow)Gen.●
T.6Visual Confusion VQAAccuracy (↑\uparrow)Gen.β—‹
T.7Visual Misleading QAAccuracy (↑\uparrow)Dis.●
S.1Risk IdentificationAccuracy (↑\uparrow)Dis.&Gen.◐
S.2Image DescriptionToxicity Score (↓\downarrow), RtA (↑\uparrow)Gen.●
S.3Toxicity Content GenerationToxicity Score (↓\downarrow), RtA (↑\uparrow)Gen.◐
S.4Plain Typographic JailbreakingASR (↓\downarrow), RtA (↑\uparrow)Gen.◐
S.5Optimized Multimodal JailbreakingASR (↓\downarrow), RtA (↑\uparrow)Gen.◐
S.6Cross-modal Influence on JailbreakingASR (↓\downarrow), RtA (↑\uparrow)Gen.◐
R.1VQA for Artistic Style imagesScore (↑\uparrow)Gen.◐
R.2VQA for Sensor Style imagesScore (↑\uparrow)Gen.●
R.3Sentiment Analysis for OOD textsAccuracy (↑\uparrow)Dis.β—‹
R.4Image Captioning under Untarget attackAccuracy (↑\uparrow)Gen.◐
R.5Image Captioning under Target attackAttack Success Rate (↓\downarrow)Gen.◐
R.6Textual Adversarial AttackAccuracy (↑\uparrow)Dis.β—‹
F.1Stereotype Content DetectionContaining Rate (↓\downarrow)Gen.●
F.2Agreement on StereotypesAgreement Percentage (↓\downarrow)Dis.◐
F.3Classification of StereotypesAccuracy (↑\uparrow)Dis.β—‹
F.4Stereotype Query TestRtA (↑\uparrow)Gen.◐
F.5Preference Selection in VQARtA (↑\uparrow)Gen.●
F.6Profession PredictionPearson’s correlation (↑\uparrow)Gen.◐
F.7Preference Selection in QARtA (↑\uparrow)Gen.●
P.1Visual Privacy RecognitionAccuracy, F1 (↑\uparrow)Dis.β—‹
P.2Privacy-sensitive QA RecognitionAccuracy, F1 (↑\uparrow)Dis.β—‹
P.3InfoFlow ExpectationPearson's Correlation (↑\uparrow)Gen.β—‹
P.4PII Query with Visual CuesRtA (↑\uparrow)Gen.◐
P.5Privacy Leakage in VisionRtA (↑\uparrow), Accuracy (↑\uparrow)Gen.◐
P.6PII Leakage in ConversationsRtA (↑\uparrow)Gen.◐

βš›οΈ Overall Results

  • Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
  • A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
  • Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
result

:black_nib: Citation

If you find our work helpful for your research, please consider citing our work.

@article{zhang2024benchmarking,
  title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study},
  author={Zhang, Yichi and Huang, Yao and Sun, Yitong and Liu, Chang and Zhao, Zhe and Fang, Zhengwei and Wang, Yifan and Chen, Huanran and Yang, Xiao and Wei, Xingxing and others},
  journal={arXiv preprint arXiv:2406.07057},
  year={2024}
}