ColonEval: Evaluating multimodal understanding behaviours of MLLM

April 13, 2026 Β· View on GitHub

πŸ“’ To facilitate rapid evaluation on leading multimodal large language models (MLLMs), we derived a subset, ColonEval, from the test set of ColonVQA. This subset encompasses 4,568 VQA entries across 16 distinct tasks, covering a wide range of clinical scenarios in colonoscopy. ColonEval is designed to assess the generalization capabilities of MLLMs in the context of colonoscopy.

1. Get Started with Inferring MLLMs on ColonEval

Important

πŸ“’ Recommended! Feel free to skip this inference step -- because we've already done all the heavy lifting for you! Please check out πŸ“here to obtain all prediction results and their evaluation scores reported in our article.

To evaluate an MLLM using ColonEval, please first follow the inference instructions provided by the official repository of each respective MLLMs. Here, we provide two example demos, an πŸ“open-source and a πŸ“closed-source model, to illustrate the workflow.

Before running any MLLMs, please download the ColonEval dataset and organize it as follows:

πŸ“ cache/
└── πŸ“ data/
    β”œβ”€β”€ πŸ“ JSON/                                   # all annotation *.json files
        β”œβ”€β”€ πŸ“ ColonEval/                          # evaluation JSONs for benchmarking MLLM generalizability
            β”œβ”€β”€ Task1_predict_by_your_model.json   # testing split with task-specific JSON files
            └── ...

Then create and activate a conda environment. Our default setup uses CUDA 11.8. Notably, we did not test other versions and cannot guarantee compatibility.

conda create -n coloneval python=3.10 -y
conda activate coloneval

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r ColonEval/requirements.txt

1.1. An open-source demo

We provide a MedGemma inference example to demonstrate how we perform this process. If you require other open-source models, please follow their respective inference protocols.

  • Firstly, download the MedGemma-4B checkpoints from πŸ€—HuggingFace and place them in the cache/exp/ColonEval path.
  • Secondly, set EXP_MODEL_ID to the path of your model checkpoints, such as cache/exp/ColonEval/medgemma-4b-it.
  • Then set IMAGE_BASE_PATH and ROOT_PATH to the root directories of the images and JSON files, respectively.
  • Next, please configure the environment according to πŸ”—MedGemma's instructions.
  • Finally, just run bash ColonEval/infer_open_source_demo.sh to start inference.

1.2. A closed-source demo

We provide a o4-mini infer example to demonstrate how we perform this process. If you require other open-source models, please follow their respective inference protocols.

  • Firstly, please purchase the API quota from the model provider, for example, πŸ”—o4-mini.
  • Secondly, set MODEL as the model name. Use the officially published name, such as o4-mini-2025-04-16.
  • Then set IMAGE_BASE_PATH and ROOT_PATH to the root directories of the images and JSON files, respectively.
  • Finally, just run bash ColonEval/infer_closed_source_demo.sh to start inference.

2. One-command Evaluation Code

If you are using ColonEval to test the generalization ability of MLLM, you only need to modify a few configurations in ColonEval/eval_generalizability.sh to execute it.

  • If you performed the inference as described above, the prediction files should already be in the cache/exp/ColonEval/Model_Name/pred path.

  • Place the prediction file you need to eval in the cache/exp/ColonEval/Model_Name path.

  • Set EXP_MODEL_ID to the path of the model you want to eval. See πŸ“here for information on the models we evaluated.

  • Now just use bash ColonEval/eval_generalizability.sh to start the evaluation.

  • When the process finishes, the evaluation results will be displayed here: $EXP_MODEL_ID/pred/.

  • An evaluation example is as follows:

    #!/bin/bash
    
    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    
    EXP_MODEL_ID=cache/exp/ColonEval/medgemma-4b-it
    EVAL_MODE=pilot
    
    python ColonEval/eval_engine.py \
        --task_id 1 \
        --eval_mode $EVAL_MODE \
        --input_file $EXP_MODEL_ID/pred/pred_Task_1_ColonEval.json \
        --output_file $EXP_MODEL_ID/pred/Task_1.txt > $EXP_MODEL_ID/pred/eval_task_1_log.txt 2>&1
    

3. Benchmarking Results

The below is the generalizability of 22 MLLMs across four task categories and their integration within ColonEval. Accuracy is computed using a weighted arithmetic mean, with weights proportional to the sample count of each task category. The top three scores of both open and closed-source camps are highlighted using distinct colors (1st, 2nd, 3rd). Prediction results for each model on each task are available on πŸ”—Google Drive.


Table 1: Generalizability of 22 multimodal large language models (MLLMs) on ColonEval.

4. Data statistics

Here we present the ColonEval statistics, including the categories and VQA entries for each task.

TaskCategoryVQA entries
1 Grading of Bowel Cleanliness650
2 Colonoscopy Completion Landmark Identification350
3 Rectum Retroflexion Identification250
4 Operative Status Recognition450
5 Imaging quality assessment350
6 Imaging Modality Classification571
7 Instrument Recognition550
8 Bleeding Warning550
9 Lesion diagnosis yes or no mode411258
10 Lesion diagnosis single choice mode41629
11 Lesion diagnosis open vocabulary mode41629
12 Referring Expression Generation30707
14 NICE Criteria350
15 PARIS Criteria4411
16 Polyp Sizing4414
17 Grading of Ulcerative Colitis750
Total764568

5. Use ColonEval to evaluate your own model

Note

πŸ’‘ Finally, we warmly welcome everyone to participate in the evaluation. If needed, you can also send us your results in a package so we can conduct the evaluation. Please contact us via email: πŸ“§ gepengai.ji@gmail.com.

If you want to use ColonEval to evaluate your own model, please follow these steps.

  • Firstly, organize your prediction file structure as follows.

    πŸ“ cache/
    └── πŸ“ data/
        β”œβ”€β”€ πŸ“ JSON/                                   # all annotation *.json files
            β”œβ”€β”€ πŸ“ ColonEval/                          # evaluation JSONs for benchmarking MLLM generalizability
                β”œβ”€β”€ Task1_predict_by_your_model.json   # testing split with task-specific JSON files
                └── ...
    
  • Secondly, standardize the format of your prediction files as follows,

    • "id": Relative path pointing to the associated image. Commonly used by dataloaders to locate the visual input.

    • "image": Typically identical to id, as a backup.

    • "conversations": An ordered list representing a multi-turn dialogue. Each element includes:

      • "from": Indicates the speaker role, either "human" (prompt), "gpt" (response) or "prediction" (prediction).
      • "value": Text content of that turn. "human" represents the question or instruction, "gpt" represents the reference, and "prediction" represents the model's prediction.
          {
              "id": "relative/path/to/an/image",
              "image": "relative/path/to/an/image",
              "conversations": [
                  {
                      "from": "human",
                      "value": "question/instruction"
                  },
                  {
                      "from": "gpt",
                      "value": "reference"
                  },
                  {
                      "from": "prediction",
                      "value": "your model prediction"
                  }
              ]
          }
          ......(more entries)