ColonPert: Benchmarking MLLMs on Reliability in Colonoscopy Tasks

December 9, 2025 ยท View on GitHub

Important

๐Ÿ“ข To test the MLLMs' performance under challenging types of human perturbation, we developed a subset of tests called ColonPert. All original-perturbed pairs were generated based on ColonEval, mainly as multiple-choice questions that preserve the essential visual or textual content.

1. Run ColonPert with Your MLLMs

If you want to use ColonPert to evaluate various models, please first refer to the official code of each model for inference. Here we provide two examples for demonstration: one ๐Ÿ“open-source and one ๐Ÿ“closed-source model. For the models we evaluated and the results in this article, please refer to ๐Ÿ“here.

We assume that you have followed the ๐Ÿ“instructions to download the ColonPert dataset and organized it as follows:

๐Ÿ“ cache/
โ””โ”€โ”€ ๐Ÿ“ data/
    โ”œโ”€โ”€ ๐Ÿ“ JSON/                                  # all annotation *.json files
        โ”œโ”€โ”€ ๐Ÿ“ ColonPert/                         # evaluation JSONs for benchmarking MLLM generalizability
            โ”œโ”€โ”€ TestA_on_image_text_masking.json
            โ”œโ”€โ”€ TestB_on_image_misleading_text.json
            โ”œโ”€โ”€ TestC_case_contradicting_instruction.json
            โ””โ”€โ”€ TestD_emotion_driven_decision_bias.json

1.1. open-source demo

We demonstrate the workflow using MedGemma. If you wish to test other models, you may adapt the script accordingly.

  • Firstly, download the MedGemma-4B checkpoints from ๐Ÿค—HuggingFace and place them in cache/exp/ColonPert.
  • Secondly, set EXP_MODEL_ID to the path of your model checkpoints, such as cache/exp/ColonPert/medgemma-4b-it.
  • Configure IMAGE_BASE_PATH and ROOT_PATH for the images and JSON files, respectively.
  • Prepare the environment following the ๐Ÿ”—MedGemma's instructions.
  • Finally, run inference bash ColonPert/infer_open_source_demo.sh

closed-source demo

Here we use ๐Ÿ”—o4-mini as an example.

  • Obtain the API key from OpenAI.
  • Set MODEL=o4-mini-2025-04-16 as the model name.
  • Then set IMAGE_PATH and ROOT_PATH for the images and JSON files.
  • Run inference: bash ColonPert/infer_closed_source_demo.sh

2. One-click Evaluation

To evaluate model reliability on ColonPert, simply modify a few parameters in ColonPert/eval_reliability.sh as follows:

  • Place your checkpoint under cache/checkpoints/pert-exp.

  • Set EXP_MODEL_ID to the path of the model you want to evaluate.

  • Start evaluation:

    conda activate coloneval
    bash ColonPert/eval_reliability.sh
    
  • An example case from our evaluation script is as follows:

    #!/bin/bash
    
    EXP_MODEL_ID=cache/exp/robust-exp/medgemma-4b-it
    EVAL_MODE=pert
    
    python ColonEval/eval_engine.py \
        --task_id A \
        --eval_mode $EVAL_MODE \
        --input_file $EXP_MODEL_ID/pred/TestA_on_image_text_masking.json \
        --output_file $EXP_MODEL_ID/pred/Task_A.txt > $EXP_MODEL_ID/pred/eval_task_A_log.txt 2>&1 &
    

3. Data statistics

Here we present the ColonPert statistics, including the categories and VQA entries for each task.

TestVQA entries
Test.A - On image text masking57
Test.B - On image misleading text100
Test.C - Case contradicting instruction20
Test.D - Emotion driven decision bias80

4. Benchmarking Results

ColonPert evaluates the reliability of MLLMs by analyzing accuracy variations under four human-induced perturbations. All prediction files and per-task scores are available on ๐Ÿ”—Google Drive. Table 1 reports the performance of six leading MLLMs.


Table 1: Reliability test of six multimodal large language models (MLLMs) on ColonPert.

5. Visual cases

Figure 1 illustrates representative examples from ColonPert and the model's performance under four human-induced perturbations, including on-image text in visual prompts (Test.A & Test.B) and explicitly through textual prompts (Test.C & Test.D).

  • Test.A: Obscure embedded texts (e.g., device information) in images to test text bias.
  • Test.B: Overlay erroneous texts in image corners to test resistance to visual textual interference.
  • Test.C: Inject case-contradicting descriptions (e.g., describing malignant as "benign") into textual prompts.
  • Test.D: Incorporate patient emotional states (e.g., anxiety, fear) into prompts to test for emotional bias.


Figure 1: Illustration of four types of human-induced perturbations.