How to train, infer, and evaluate ColonR1

January 2, 2026 Β· View on GitHub


Figure 1: Details of our colonoscopy-specific reasoning model, ColonR1.

🏁 Installation guide

Important

πŸ“Œ Troubleshooting guide. If you encounter any issues during installation or execution, please refer to our πŸ“ Troubleshooting Guide for solutions to common problems.

  • First, clone the repository and install the required dependencies:

    git clone git@github.com:ai4colonoscopy/Colon-X.git
    cd COLON-X
    
  • Create and activate a Conda environment. Notably, our default setup uses CUDA 11.8, not guarantee other versions.

    conda create -n colonr1 python=3.10 -y
    conda activate colonr1
    
    pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
    pip install flash-attn --no-build-isolation
    pip install -r ColonR1/requirements.txt
    
  • Download the pretrained weights, for inference.

  • Prepare the data, for details, please refer to πŸ“ here. We assume you have done this already.

  • Finally, double check and ensure your directory has the following structure.

    πŸ“ cache/                                   # all cached data, weights, and structured dataset files
    β”œβ”€β”€ πŸ“ checkpoints/                         # trained ColonR1 model checkpoints
    β”‚   └── πŸ“ ColonR1-Qwen2.5-VL-GRPO-thinking-StageII
    β”‚
    β”œβ”€β”€ πŸ“ data/                                # dataset root containing all images and annotations
    β”‚   β”œβ”€β”€ πŸ“ Positive-images/                 # images with positive clinical findings (polyps, lesions, etc.)
    β”‚   β”œβ”€β”€ πŸ“ Negative-images/                 # normal images without pathology
    β”‚   β”œβ”€β”€ πŸ“ JSON/                            # annotation files for training / validation / testing
    β”‚   β”‚   β”œβ”€β”€ πŸ“ Train-Val-merge/             # combined training + validation JSONs
    β”‚   β”‚   └── πŸ“ Test/                        # test JSONs for inference and evaluation
    β”‚
    β”œβ”€β”€ πŸ“ download-weights/                    # downloaded pretrained model weights
    β”‚   β”œβ”€β”€ πŸ“ Qwen2.5-VL-3B-Instruct
    β”‚   β”œβ”€β”€ πŸ“ gpt-oss-20b
    β”‚   └── πŸ“ all-MiniLM-L6-v2
    β”‚
    └── πŸ“ ColonR1/                             # main ColonR1 codebase for training, inference, and evaluation
    

πŸš… Training

Before starting training, please update the configs as needed:

  • Set S1_OUTPUT_FILE and S1_OUTPUT_DIR β€” the output name and path for Stage-I.
  • Set IMAGE_ROOT and S1_JSON_FILE β€” typically cache/data and ColonReason_GRPO.json.
  • Set S1_BASE_MODEL β€” path to the Qwen2.5-VL-3B-Instruct weights.
  • Set S2_OUTPUT_FILE and S2_OUTPUT_DIR β€” the output name and path for Stage-II.

Then start training:

bash ColonR1/script/train/ColonR1_grpo_thinking.sh

πŸ’­ Inference

Single-image Inference

To use ColonR1 for single-image chat, use the following command:

  • Set MODEL_PATH and IMAGE_PATH to the paths of the saved checkpoints and image you want to evaluate on, respectively.
  • Run bash ColonR1/script/infer_eval/infer_single.sh, then enter your instruction and the result will be printed on the screen.

Batch Inference

We provide one-key inference code. If you use ColonEval or follow the same data organization format, you only need to modify a few configurations in ColonR1/script/infer_eval/infer.sh to perform inference.

Or you can infer it on your customized data

  • Set IMAGE_BASE_PATH and ROOT_PATH to the path of cache/data and cache/data/JSON/Test.

  • Set EXP_MODEL_ID to the path of the model weight you want to infer.

  • Then use bash ColonR1/script/infer_eval/infer.sh to start inference.

  • An example of an inference script is as follows:

    #!/bin/bash
    
    IMAGE_BASE_PATH=cache/data
    ROOT_PATH=cache/data/JSON/Test
    EXP_MODEL_ID=cache/checkpoints/ft-exp/ColonR1-Qwen2.5-VL-GRPO-thinking-StageII
    
    mkdir -p $EXP_MODEL_ID/pred
    
    export CUDA_VISIBLE_DEVICES=0
    
    nohup python ColonR1/serve/inference.py \
    --model_path $EXP_MODEL_ID \
    --image_dir $IMAGE_BASE_PATH \
    --json_file $ROOT_PATH/ColonEval/Task_1_ColonEval.json \
    --output_path $EXP_MODEL_ID/pred/pred_Task_1_ColonEval.json > $EXP_MODEL_ID/pred/nohup-pred_task1.txt 2>&1 &
    

Gradio Web Demo Inference

Note

What is Gradio? Gradio is an open-source Python library that allows you to quickly create customizable web-based interfaces for machine learning models. It enables users to interact with models through a user-friendly graphical interface, making it easier to demonstrate and test model capabilities without requiring extensive coding knowledge.

To launch the Gradio web demo for ColonR1, follow these steps:

conda activate colonr1
# `--model_path` should point to your ColonR1 model checkpoint
python ColonR1/serve/inference_gradio_web_demo.py --model_path cache/checkpoints/ft-exp/ColonR1-Qwen2.5-VL-GRPO-thinking-StageII

This will start a local web server, and you can access the demo by navigating to http://localhost:7860 in your web browser. You can upload colonoscopy images and interact with the ColonR1 model through the web interface.

The below image showcases an example predicted by our ColonR1 in an interactive manner in Gradio UI demo.

image

πŸ’― Evaluation

  • To perform the evaluation, Set EXP_MODEL_ID to the path of the model you want to evaluate.

  • Then, if you wish to use ColonEval for evaluation, set EVAL_MODE to pilot.

  • Finally, run the following command to begin the evaluation. (For ColonEval's environment configuration, please refer to πŸ“ here)

    conda activate coloneval
    bash ColonR1/script/infer_eval/eval.sh
    
  • An example of an evaluation script is as follows:

    #!/bin/bash
    
    EXP_MODEL_ID=cache/checkpoints/ft-exp/ColonR1-Qwen2.5-VL-GRPO-thinking-StageII
    EVAL_MODE=pilot
    
    python ColonR1/serve/understanding_eval.py \
        --task_id 1 \
        --data_type reasoning \
        --eval_mode $EVAL_MODE \
        --input_file $EXP_MODEL_ID/pred/pred_Task_1_ColonEval.json \
        --output_file $EXP_MODEL_ID/pred/Task_1.txt > $EXP_MODEL_ID/pred/eval_task_1_log.txt 2>&1
    

Results

Here is the comparison of multimodal reasoning abilities under various fine-tuning methods. NS and SP denote the use of negative sampling and self-evolving prompting, respectively. Overall accuracy of ColonR1 on ColonEval is reported in the last column. All prediction results and evaluation scores for ColonR1 are available on πŸ”—Google Drive.


Table 1: Comparison of multimodal reasoning abilities under various fine-tuning methods.


Figure 2: Qualitative comparison of COLONR1 with Med-R1 and Qwen-SFT.