How to Download and Prepare Our Dataset

December 9, 2025 Β· View on GitHub


Figure 1: Visualization of 76 clinical categories from our dataset.

1. Data Overview of Our Colon-X Project

Important

πŸ“Œ Revisiting. Building upon the most comprehensive multimodal colonoscopy database ColonVQA, we propel a pivotal transition in intelligent colonoscopy, evolving from multimodal understanding (ColonEval & ColonPert) to clinical reasoning (ColonReason & ColonR1). These efforts collectively illuminate the path to neXt-generation advances in clinical COLONoscopy and broader medical applications.

On this page, we are about to provide a step-by-step guide on how to download and prepare these four data parts -- including ColonVQA, ColonEval, ColonPert, and ColonReason -- for all experiments presented in our research paper.

1.1. Data access

We first introduce ColonVQA, the main multimodal dataset in our Colon-X project. The colonoscopy images were sourced from 32 public datasets, and due to strict licensing, we cannot share the datasets or download links with you directly. Here are the steps to obtain the complete dataset:

  • Requesting access to raw images. We recommend requesting the images from the providers of these datasets first. Please follow the official links below to request access to the 32 public data origins.

    Data IDData NameTrainValTestURL
    Data #1CAD-CAP55127692request by email
    Data #2CVC-ClinicDB550-62Link
    Data #3CVC-ColonDB--380Link
    Data #4EDD20201111857Link
    Data #5ETIS-Larib--196Link
    Data #6PICCOLO2,127872325Link
    Data #7PolypGen1,847511463Link
    Data #8PS-NBI2K1,343-337Link
    Data #9Kvasir1,943200677Link
    Data #10Hyper-Kvasir3,0315071,515Link
    Data #11ASEI1,257211625Link
    Data #12Kvasir-Capsule4,6067672,305Link
    Data #13GastroVision2,0243381,014Link
    Data #14SUN-SEG19,544-29,592Link
    Data #15WCEBleedGen39866199Link
    Data #16Capsule Vision 20245,501-2,362Link
    Data #17KID144920Link
    Data #18KID222337111Link
    Data #19in vivo1,844124846Link
    Data #20KUMC27,0484,2144,719Link
    Data #21CP-CHILD1,100-300Link
    Data #22LIMUC9,590-1,686Link
    Data #23SSL-CPCD1512575Link
    Data #24MedFMC795133397Link
    Data #25WCE Colon Disease1,6001,000400Link
    Data #26CPC-Paired2083488Link
    Data #27ColonoscopicDS9,2122,0703,843Link
    Data #28PolypDB2,3613941,179Link
    Data #29Kvasir-Instrument452-113Link
    Data #30LDPolyVideo19,876-11,639Link
    Data #31Endo4IE2,7411321,140Link
    Data #32Nerthus3,3155521,658Link
  • Accessing our JSON files. Once you have successfully requested and downloaded the original images, you can download the VQA JSON files provided in our project, which contains four data parts: ColonVQA/ColonEval/ColonPert/ColonReason (πŸ€— Huggingface | Gdrive).

  • Reorganize file structure. If you downloaded the images from a public link, their folder layout may differ from the one used in our project. Please use the paths specified in the JSON files as a guide and arrange the images accordingly so that your directory structure aligns with what the Colon-X project expects.

  • ⭐️Recommended -- Too much hassle for you? We also provide a fully organized version of the dataset. If you prefer not to restructure the files yourself, you can request access by filling out 🈸 google form to request. If you have any questions during the data application process, feel free to contact us at πŸ“§ gepengai.ji@gmail.com & πŸ“§ jingyi.liu2657@gmail.com.

1.2 Directory Structure of Four Data Parts

At this point, we assume you’ve already got the full data resources, which contain four parts:

  • ColonVQA (main part): The main multimodal dataset, containing 1.1M+ visual question answering pairs across 76 clinical findings and 18 task types. Designed for instruction tuning and benchmarking multimodal tasks in colonoscopy.
  • ColonEval: A dedicated evaluation suite with curated test sets covering diverse clinical tasks. Built to assess the generalization of MLLMs in real clinical scenarios.
  • ColonPert: A perturbation-based dataset consisting of origin–perturbation image pairs. Used to evaluate how reliably MLLMs handle human-induced perturbations.
  • ColonReason: A reasoning-focused dataset that provides step-by-step or chain-of-thought style annotations, enabling the study and training of explicit reasoning capabilities for colonoscopy decision-making.

As shown below, your final directory structure should look something like this:

πŸ“ cache/
└── πŸ“ data/
    β”œβ”€β”€ πŸ“ JSON/                                   # all annotation *.json files
    β”‚   β”œβ”€β”€ πŸ“ ColonVQA/                           # main multimodal dataset (1.1M+ VQA entries)
    β”‚   β”‚   β”œβ”€β”€ πŸ“ train/                          # training split with task-specific JSON files
    β”‚   β”‚   β”‚   β”œβ”€β”€ 1_Grading_of_Bowel_Cleanliness_Train.json
    β”‚   β”‚   β”‚   └── ...
    β”‚   β”‚   β”œβ”€β”€ πŸ“ val/                            # validation split
    β”‚   β”‚   β”‚   └── ...
    β”‚   β”‚   └── πŸ“ test/                           # testing split
    β”‚   β”‚       └── ...
    β”‚   β”œβ”€β”€ πŸ“ ColonEval/                          # evaluation JSONs for benchmarking MLLM generalizability
    β”‚   β”œβ”€β”€ πŸ“ ColonPert/                          # origin–perturbation pairs for robustness evaluation
    β”‚   └── πŸ“ ColonReason/                        # reasoning-style annotations (step-by-step / chain-of-thought)
    β”‚
    β”œβ”€β”€ πŸ“ Positive-images/                        # images with positive clinical findings
    β”‚   └── πŸ“ ASEI/
    β”‚       β”œβ”€β”€ πŸ“ Train/
    β”‚       β”‚   └── πŸ“ polyp/
    β”‚       β”‚       β”œβ”€β”€ 2.jpg
    β”‚       β”‚       └── ...
    β”‚       β”œβ”€β”€ πŸ“ Val/
    β”‚       β”‚   └── ...
    β”‚       └── πŸ“ Test/
    β”‚           └── ...
    β”‚
    β”œβ”€β”€ πŸ“ Negative-images/                        # images without pathological findings
    β”‚   └── ...
    β”‚
    β”œβ”€β”€ πŸ“ Misleading-text-images/                 # perturbed images with misleading text (used in ColonPert)
    β”‚   └── ...
    β”‚
    └── πŸ“ Text-mask-images/                       # text-masked versions (used in ColonPert)
        └── ...

2. ColonVQA

Here, we will introduce the general formats in our main multimodal dataset, ColonVQA. As for other derived sub-datasets, please refer to specific documents stored in ./docs/*.

2.1. Illustration of Data Format

Important

πŸ“Œ Note. Our data format is compatible with most MLLM training frameworks that support conversational-style datasets. This modular design also makes it easy to extend -- whether by adding new tasks, introducing new annotation types, or incorporating additional imaging modalities in the future.

  • All JSON annotation files share a unified structure across all colonoscopy-related tasks (including diagnosis, quality assessment, detection, report generation, etc.). This unified design enables vision–language interaction and simplifies data loading for different tasks.
    • For complete task definitions, please refer to πŸ”—task_card.pdf or inspect the JSON files directly.
  • Field Descriptions
    • "id": Relative path pointing to the associated image. Commonly used by dataloaders to locate the visual input.

    • "image": Typically identical to id, as a backup.

    • "conversations": An ordered list representing a multi-turn dialogue. Each element includes:

      • "from": Indicates the speaker role, either "human" (prompt) or "gpt" (response).
      • "value": Text content of that turn. "human" turns always start with "", denoting that the visual input is provided to the model. Questions are randomly selected from predefined templates corresponding to different tasks. And "gpt" represents the reference.
      {
          "id": "relative/path/to/an/image",
          "image": "relative/path/to/an/image",
          "conversations": [
              {
                  "from": "human",
                  "value": "<image>\nA randomly selected question from 5 templates"
              },
              {
                  "from": "gpt",
                  "value": "The answer"
              }
          ]
      }
      

2.2. Data statistics

  • Data-Category Table: We detail the index number (DATA#) of each dataset, the counts of images in the training-validation-test splits, and the types of annotations: category tags (Cat.) and bounding boxes (Bbx.). Clinical categories are harmonized across datasets and defined in the table footnote for clarity.


Table 2: Overview of colonoscopy imaging data included in COLONVQA.

  • Category-Task Table: We report the count of VQA pairs for 76 categories and 18 tasks in our ColonVQA. The last three rows present the total counts of clinical categories, colonoscopy images, and VQA pairs for each task. The full names corresponding to each category abbreviation are provided in the footnote of Table 1.


Table 3: Overview of category-task statistics in ColonVQA.