README.md
November 25, 2025 · View on GitHub
Data Preparation
You should organize your data according to the following structure.
data/
├── sft/
│ ├── images.zip
│ └── json
│ ├── sft_part_0.json
│ ├── sft_part_1.json
│ ├── sft_part_2.json
│ ├── sft_part_3.json
│ └── sft_part_4.json
└── rl/
│ ├── perception_all_1.parquet
│ ├── perception_all_2.parquet
│ ├── perception_all_3.parquet
│ ├── perception_all_4.parquet
│ ├── perception_all_5.parquet
│ ├── reason.parquet
│ └── search.parquet
└── search_cache/
├── fvqa_train_image_search_results_cache.json
└── cached_images
└── train.zip
Cold Start Data
Please download the SFT data from here. You should firstly unzip the images.zip by using the following command.
cd sft
unzip images.zip
After that, you can run data_convert.py to convert the json data.
python ../cold_start/data_convert.py --input_path path_to_json_path --data_path path_to_image_path
It is worth noting that we do not provide the multimodal CoT SFT data due to policy reasons.
Search Cache
Please download the search cache from here.
You should firstly unzip the train.zip.
cd search_cache/cached_images
unzip train.zip
Then, you should run cache_convert.py to convert the json data.
python ../reinforcement_learning/cache_convert.py --input_json_path path_to_json_path --data_path path_to_image_path