Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning
January 30, 2026 ยท View on GitHub

BAGEL Training Zebra-CoT
This repository is adapted from the Bagel repository.
Setup
git clone https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT.git
cd Bagel-Zebra-CoT
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
Download checkpoint
Set the HF_HOME in download_model.py to the path of the checkpoint you want to download.
python download_model.py
You can also do this straight from python if your HF_HOME has already been set.
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="multimodal-reasoning-lab/Bagel-Zebra-CoT",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
Inference

The inference script (infz_bf16.py) supports inherent interleaved text and visual reasoning. To customize it for your
specific use case:
1. Model Checkpoint Path
Update the checkpoint path to point to your model:
checkpoint_dir = "/path/to/your/HF_HOME/models/Bagel-Zebra-CoT"
For example, under the HF_HOME, the path to the checkpoint folder is:
checkpoint_dir = f"{HF_HOME}/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/c1ff3c56dd5909841523e3a6b554c77d919c2b28
You can also use the local dir:
checkpoint_dir = f"{HF_HOME}/models/Bagel-Zebra-CoT
2. Setting up prompt and images
Edit the prompt and image variables in infz_bf16.py (around lines 203-211):
For single image problems:
prompt = "Your question here"
image = Image.open('path/to/your/image.png')
For multiple image problems:
prompt = "Your question about multiple images"
image_1 = Image.open('path/to/image1.jpg')
image_2 = Image.open('path/to/image2.jpg')
image_3 = Image.open('path/to/image3.jpg')
image = [image_1, image_2, image_3] # List of images
For text-only problems:
prompt = "Your text-only question"
image = None
3. Inference Parameters
You can adjust the generation parameters in the inference_hyper dictionary:
inference_hyper = dict(
do_sample=True,
text_temperature=0.3,
cfg_text_scale=4.0,
cfg_img_scale=2.0,
cfg_interval=[0.0, 1.0],
timestep_shift=3.0,
num_timesteps=50,
cfg_renorm_min=0.0,
cfg_renorm_type="text_channel",
)
For details, refer to the original jupyter notebook here.
Example Use Cases
prompt = "Subtract all cylinders. Add 1 red sphere. How many objects are left?"
image = Image.open('test_images/image.png')
Training
For training, run
bash scripts/train.sh
For details, please refer to the original repo README.
The interleaved reasoning data customized for Zebra-CoT can be found in think_trace_dataset.py.
Cite
@inproceedings{
li2026zebracot,
title={Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning},
author={Ang Li and Charles Wang and Deqing Fu and Kaiyu Yue and Zikui Cai
and Wang Bill Zhu and Ollie Liu and Peng Guo and Willie Neiswanger
and Furong Huang and Tom Goldstein and Micah Goldblum},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=c6XIVI3TiQ}
}